Correlation and regression analysis in Excel: instructions for execution. Study of the statistical dependence of changes in the properties of the reservoir and reservoir fluids as a result of the development of oil fields Parabolic and polynomial regression

The following data is available different countries on the index of retail prices for food (x) and on the index of industrial production (y).

	Retail food price index (x)	Industrial production index (y)
1	100	70
2	105	79
3	108	85
4	113	84
5	118	85
6	118	85
7	110	96
8	115	99
9	119	100
10	118	98
11	120	99
12	124	102
13	129	105
14	132	112

Required:

1. To characterize the dependence of y on x, calculate the parameters of the following functions:

A) linear;

B) sedate;

B) an equilateral hyperbola.

3. Assess the statistical significance of regression and correlation parameters.

4. Make a forecast of the value of the industrial production index y with the forecast value of the retail food price index x=138.

Solution:

1. To calculate linear regression parameters

Solving the system normal equations regarding a and b:

Let's build a table of calculated data, as shown in Table 1.

Table 1 Estimated data for linear regression estimation

No.	X	at	xy	x 2	y 2
1	100	70	7000	10000	4900	74,26340	0,060906
2	105	79	8295	11025	6241	79,92527	0,011712
3	108	85	9180	11664	7225	83,32238	0,019737
4	113	84	9492	12769	7056	88,98425	0,059336
5	118	85	10030	13924	7225	94,64611	0,113484
6	118	85	10030	13924	7225	94,64611	0,113484
7	110	96	10560	12100	9216	85,58713	0,108467
8	115	99	11385	13225	9801	91,24900	0,078293
9	119	100	11900	14161	10000	95,77849	0,042215
10	118	98	11564	13924	9604	94,64611	0,034223
11	120	99	11880	14400	9801	96,91086	0,021102
12	124	102	12648	15376	10404	101,4404	0,005487
13	129	105	13545	16641	11025	107,1022	0,020021
14	132	112	14784	17424	12544	110,4993	0,013399
Total:	1629	1299	152293	190557	122267	1299,001	0,701866
Average value:	116,3571	92,78571	10878,07	13611,21	8733,357	X	X
	8,4988	11,1431	X	X	X	X	X
	72,23	124,17	X	X	X	X	X

The average value is determined by the formula:

We calculate the standard deviation using the formula:

and enter the result in Table 1.

By squaring the resulting value we get the variance:

The parameters of the equation can also be determined using the formulas:

So the regression equation is:

Therefore, with an increase in the retail food price index by 1, the industrial production index increases on average by 1.13.

Let's calculate the linear pair correlation coefficient:

The connection is direct and quite close.

Let's determine the coefficient of determination:

The variation in the result is 74.59% explained by the variation in the x factor.

Substituting the actual values of x into the regression equation, we determine the theoretical (calculated) values.

therefore, the parameters of the equation are determined correctly.

Let's calculate the average approximation error - the average deviation of the calculated values from the actual ones:

On average, the calculated values deviate from the actual ones by 5.01%.

We will assess the quality of the regression equation using the F-test.

The F-test consists of testing the hypothesis H 0 about the statistical insignificance of the regression equation and the indicator of the closeness of the relationship. To do this, a comparison is made between the actual F fact and the critical (tabular) F table values of the Fisher F-criterion.

F fact is determined by the formula:

where n is the number of population units;

m is the number of parameters for variables x.

The obtained estimates of the regression equation allow it to be used for forecasting.

If the forecast value of the retail food price index is x = 138, then the forecast value of the industrial production index will be:

2. Power regression has the form:

To determine the parameters, logarithm of the power function is performed:

To determine the parameters of the logarithmic function, a system of normal equations is constructed using the least squares method:

Let's build a table of calculated data, as shown in Table 2.

Table 2 Calculated data for estimating power regression

No.	X	at	lg x	lg y	lg x*lg y	(log x) 2	(log y) 2
1	100	70	2,000000	1,845098	3,690196	4,000000	3,404387
2	105	79	2,021189	1,897627	3,835464	4,085206	3,600989
3	108	85	2,033424	1,929419	3,923326	4,134812	3,722657
4	113	84	2,053078	1,924279	3,950696	4,215131	3,702851
5	118	85	2,071882	1,929419	3,997528	4,292695	3,722657
6	118	85	2,071882	1,929419	3,997528	4,292695	3,722657
7	110	96	2,041393	1,982271	4,046594	4,167284	3,929399
8	115	99	2,060698	1,995635	4,112401	4,246476	3,982560
9	119	100	2,075547	2,000000	4,151094	4,307895	4,000000
10	118	98	2,071882	1,991226	4,125585	4,292695	3,964981
11	120	99	2,079181	1,995635	4,149287	4,322995	3,982560
12	124	102	2,093422	2,008600	4,204847	4,382414	4,034475
13	129	105	2,110590	2,021189	4,265901	4,454589	4,085206
14	132	112	2,120574	2,049218	4,345518	4,496834	4,199295
Total	1629	1299	28,90474	27,49904	56,79597	59,69172	54,05467
Average value	116,3571	92,78571	2,064624	1,964217	4,056855	4,263694	3,861048
	8,4988	11,1431	0,031945	0,053853	X	X	X
	72,23	124,17	0,001021	0,0029	X	X	X

Continuation of Table 2 Calculated data for estimating power regression

No.	X	at
1	100	70	74,16448	17,34292	0,059493	519,1886
2	105	79	79,62057	0,385112	0,007855	190,0458
3	108	85	82,95180	4,195133	0,024096	60,61728
4	113	84	88,59768	21,13866	0,054734	77,1887
5	118	85	94,35840	87,57961	0,110099	60,61728
6	118	85	94,35840	87,57961	0,110099	60,61728
7	110	96	85,19619	116,7223	0,11254	10,33166
8	115	99	90,88834	65,79901	0,081936	38,6174
9	119	100	95,52408	20,03384	0,044759	52,04598
10	118	98	94,35840	13,26127	0,037159	27,18882
11	120	99	96,69423	5,316563	0,023291	38,6174
12	124	102	101,4191	0,337467	0,005695	84,90314
13	129	105	107,4232	5,872099	0,023078	149,1889
14	132	112	111,0772	0,85163	0,00824	369,1889
Total	1629	1299	1296,632	446,4152	0,703074	1738,357
Average value	116,3571	92,78571	X	X	X	X
	8,4988	11,1431	X	X	X	X
	72,23	124,17	X	X	X	X

By solving a system of normal equations, we determine the parameters of the logarithmic function.

We get a linear equation:

Having performed its potentiation, we get:

Substituting the actual values of x into this equation, we obtain the theoretical values of the result. Based on them, we will calculate the indicators: tightness of connection - correlation index and average approximation error.

The connection is quite close.

On average, the calculated values deviate from the actual ones by 5.02%.

Thus, H 0 - the hypothesis about the random nature of the assessed characteristics is rejected and their statistical significance and reliability are recognized.

The obtained estimates of the regression equation allow it to be used for forecasting. If the forecast value of the retail food price index is x = 138, then the forecast value of the industrial production index will be:

To determine the parameters of this equation, a system of normal equations is used:

Let's make a change of variables

and we get the following system normal equations:

By solving a system of normal equations, we determine the parameters of the hyperbola.

Let's create a table of calculated data, as shown in Table 3.

Table 3 Calculated data for assessing the hyperbolic dependence

No.	X	at	z	yz
1	100	70	0,010000000	0,700000	0,0001000	4900
2	105	79	0,009523810	0,752381	0,0000907	6241
3	108	85	0,009259259	0,787037	0,0000857	7225
4	113	84	0,008849558	0,743363	0,0000783	7056
5	118	85	0,008474576	0,720339	0,0000718	7225
6	118	85	0,008474576	0,720339	0,0000718	7225
7	110	96	0,009090909	0,872727	0,0000826	9216
8	115	99	0,008695652	0,860870	0,0000756	9801
9	119	100	0,008403361	0,840336	0,0000706	10000
10	118	98	0,008474576	0,830508	0,0000718	9604
11	120	99	0,008333333	0,825000	0,0000694	9801
12	124	102	0,008064516	0,822581	0,0000650	10404
13	129	105	0,007751938	0,813953	0,0000601	11025
14	132	112	0,007575758	0,848485	0,0000574	12544
Total:	1629	1299	0,120971823	11,13792	0,0010510	122267
Average value:	116,3571	92,78571	0,008640844	0,795566	0,0000751	8733,357
	8,4988	11,1431	0,000640820	X	X	X
	72,23	124,17	0,000000411	X	X	X

Continuation of Table 3 Calculated data for assessing the hyperbolic dependence

Another type of one-factor regression is approximation by power polynomials of the form:

It is natural to want to obtain the simplest possible dependence, limiting ourselves to power polynomials of the second degree, i.e. parabolic dependence:
(5.5.2)

Let's calculate the partial derivatives with respect to the coefficients b 0 , b 1 And b 2 :

(5.5.3)

Equating the derivatives to zero, we obtain a normal system of equations:

(5.5.4)

Solving the system of normal equations (5.5.2) for a specific case of values x i * , y i * ;
we get optimal values b 0 , b 1 And b 2 . For approximation by dependence (5.5.2) and even more so (5.5.1), simple formulas for calculating the coefficients have not been obtained and, as a rule, they are calculated using standard procedures in matrix form:

(5.5.5)

Figure 5.5.1 shows a typical example of approximation by a parabolic dependence:

9 (5;9)

(1;1)

1 2 3 4 5 x

Fig.5.5.1. Coordinates of experimental points and approximated

their parabolic dependence

Example 5.1. Approximate the experimental results given in Table 5.1.1 with a linear regression equation
.

Table 5.1.1

Let's construct experimental points according to the coordinates indicated in Table 5.1.1 on the graph presented in Fig. 5.1.1.

1 2 3 4 5 x

According to Fig. 5.1.1, on which we will draw a straight line for a preliminary assessment, we will conclude that there is a clearly expressed nonlinearity in the location of the experimental points, but it is not very significant and therefore it makes sense to approximate them with a linear dependence. Note that to obtain a correct mathematical conclusion, it is necessary to construct a straight line using the method least squares.

Before performing regression analysis, it is advisable to calculate

linear correlation coefficient between variables X And at:

The significance of the correlation relationship is determined by the critical value of the linear correlation coefficient, calculated using the formula:

Critical value of Student's test t Crete found according to statistical tables for the recommended significance level α=0.05 and for n-2 degrees of freedom. If the calculated value r xy not less than the critical value r Crete, then the correlation between the variables x And y considered essential. Let's do the calculations:

Due to the fact that
we conclude that the correlation between the variables X And at is significant and it can be linear.

Let's calculate the coefficients of the regression equation:

Thus, we obtained a linear regression equation:

Using the regression equation, we draw a straight line in Fig. 5.1.2.

y (5;9.8)

(0;-0.2) 1 2 3 4 5 x

Fig.5.1.2. Coordinates of experimental points and approximated

their linear dependence

Using the regression equation, we calculate the values of the function based on the experimental points of Table 5.1.1 and the difference between the experimental and calculated values of the function, which we present in Table 5.1.2.

Table 5.1.2

Let's calculate the mean square error and its ratio to the average value:

In terms of the ratio of the standard error to the mean value, an unsatisfactory result was obtained, since the recommended value of 0.05 was exceeded.

Let's evaluate the significance level of the regression equation coefficients using the Student's t-test:

From the statistical table for 3 degrees of freedom, let's write down the lines with the significance level - and the value of the Student's criterion – t to table 5.1.3.

Table 5.1.3

Significance level of regression equation coefficients:

Note that according to the significance level for the coefficient a satisfactory result was obtained, and for the coefficient unsatisfactory.

Let us evaluate the quality of the resulting regression equation using indicators calculated on the basis of analysis of variance:

Examination:

The result of the check is positive, which indicates the correctness of the calculations performed.

Let's calculate the Fisher criterion:

with two degrees of freedom:

Using statistical tables, we find the critical values of the Fisher criterion for two recommended gradations of the significance level:

Since the calculated value of the Fisher test exceeds the critical value for the significance level of 0.01, we will assume that the significance level according to the Fisher test is less than 0.01, which will be considered satisfactory.

Let's calculate the coefficient of multiple determination:

for two degrees of freedom

Using the statistical table for the recommended significance level of 0.05 and two degrees of freedom found, we find the critical value of the coefficient of multiple determination:

Since the calculated value of the coefficient of multiple determination exceeds the critical value for the significance level
, then the level of significance according to the coefficient of multiple determination
and the result obtained for the submitted indicator will be considered satisfactory.

Thus, the obtained calculated parameters in terms of the ratio of the standard error to the mean value and the level of significance according to the Student’s test are unsatisfactory, therefore it is advisable to select another approximating dependence for approximation.

Example 5.2. Approximation of the experimental distribution of random numbers by a mathematical dependence

The experimental distribution of random numbers given in Table 5.1.1, when approximated by a linear dependence, did not lead to a satisfactory result, incl. due to the insignificance of the coefficient of the regression equation with a free term, therefore, to improve the quality of the approximation, we will try to carry it out using a linear dependence without a free term:

Let's calculate the value of the coefficient of the regression equation:

Thus, we obtained the regression equation:

Using the resulting regression equation, we calculate the values of the function and the difference between the experimental and calculated values of the function, which we present in the form of table 5.2.1.

Table 5.2.1

x i

According to the regression equation
in Fig. 5.2.1 we will draw a straight line.

y (5;9.73 )

(0;0) 1 2 3 4 5 x

Fig.5.2.1. Coordinates of experimental points and approximated

their linear dependence

To assess the quality of the approximation, we will carry out calculations of quality indicators similar to the calculations given in example 5.1.

(remains old);

with 4 degrees of freedom;

For

Based on the results of the approximation, we note that in terms of the significance level of the coefficient of the regression equation, a satisfactory result was obtained; The ratio of the standard error to the mean has improved, but is still above the recommended value of 0.05, so it is recommended to repeat the approximation with a more complex mathematical relationship.

Example 5.3. To improve the quality of approximation of examples 5.1 and 5.2, we will carry out a nonlinear approximation by the dependence
. To do this, we will first make intermediate calculations and place their results in table 5.3.1.

Values

Table 5.3.1


X 2



(lnX) 2

lnX lnY

Let's additionally calculate:

Let us approximate the dependence
. Using formulas (5.3.7), (5.3.8) we calculate the coefficients b 0 And b 1 :

Using formulas (5.3.11) we calculate the coefficients A 0 And A 1 :

To calculate the standard error, intermediate calculations were carried out, presented in Table 5.3.2.

Table 5.3.2

Y i	y i

Amount: 7.5968

The standard error of approximation turned out to be much larger than in the two previous examples, so we consider the approximation results unusable.

Example 5.4. Let's try to approximate with another nonlinear dependence
. Using formulas (5.3.9), (5.3.10) according to table 5.3.1, we calculate the coefficients b 0 And b 1 :

We got an intermediate dependence:

Using formulas (5.3.13) we calculate the coefficients C 0 And C 1 :

We got the final dependency:

To calculate the standard error, we will carry out intermediate calculations and place them in table 5.4.1.

Table 5.4.1

Y i	y i

Amount: 21.83152

Let's calculate the standard error:

The standard error of approximation turned out to be much larger than in the previous example, so we consider the approximation results unusable.

Example 5.5. Approximation of the experimental distribution of random numbers by a mathematical dependence y = b · lnx

The initial data, as in the previous examples, are shown in Table 5.4.1 and Fig. 5.4.1.

Table 5.4.1

Based on the analysis of Fig. 5.4.1 and Table 5.4.1, we note that with smaller values of the argument (at the beginning of the table) the function changes more than with larger values (at the end of the table), therefore it seems advisable to change the scale of the argument and introduce a logarithmic function into the regression equation from it and approximate with the following mathematical dependence:

. Using formula (5.4.3) we calculate the coefficient b:

To assess the quality of the approximation, we will carry out intermediate calculations presented in Table 5.4.2, from which we will calculate the magnitude of the error and the ratio of the standard error to the average value.

Table 5.4.2

Since the ratio of the standard error to the mean value exceeds the recommended value of 0.05, the result will be considered unsatisfactory. In particular, we note that the greatest deviation is given by the value x=1, since with this value lnx=0. Therefore, we will approximate the dependence y = b 0 +b 1 lnx

We present auxiliary calculations in the form of table 5.4.3.

Table 5.4.3

Using formulas (5.4.6) and (5.4.7) we calculate the coefficients b 0 and b 1 :

9 (5;9.12)

1 (1;0.93)

1 2 3 4 5 x

To assess the quality of the approximation, we will carry out auxiliary calculations and determine the level of significance of the found coefficients and the ratio of the standard error to the average value.

Significance level slightly above the recommended value of 0.05 (
).

Due to the fact that according to the main indicator - the ratio of the standard error to the average value - an almost twofold excess of the recommended level of 0.05 was obtained, we will consider the results acceptable. Note that the calculated value of the Student's test t b 0 =2,922 different from critical
by a relatively small amount.

Example 5.6. Let us approximate the experimental data of Example 5.1 by the hyperbolic dependence
. In order to calculate the coefficients b 0 and b 1 Let's carry out the preliminary calculations given in Table 5.6.1.

Table 5.6.1

X i	x i =1/X i	x i 2	x i y i

Based on the results of Table 5.6.1 using formulas (5.4.8) and (5.4.9), we calculate the coefficients b 0 and b 1 :

Thus, a hyperbolic regression equation is obtained

The results of auxiliary calculations for assessing the quality of approximation are given in Table 5.6.2.

Table 5.6.2

X i

Based on the results of Table 5.6.2, we calculate the standard error and the ratio of the standard error to the mean value:

Due to the fact that the ratio of the standard error to the mean value exceeds the recommended value of 0.05, we conclude that the approximation results are unsuitable.

Example 5.7.

To calculate specific values of income from the operation of jib cranes depending on the time of maintenance work, it is necessary to obtain a parabolic dependence.

Let us calculate the coefficients of this dependence b 0 , b 1 , b 11 in matrix form according to the formula:

Nonlinear regression equations connecting the effective indicator with the optimal values for carrying out preventive maintenance of tower cranes were obtained using the multiple regression procedure of the Statistica 6.0 application package. Next, we present the results of regression analysis for the effective performance indicator according to Table 5.7.1.

Table 5.7.1

Table 5.7.2 shows the results of nonlinear regression for the effective performance indicator and Table 5.7.3 shows the results of the analysis of residuals.

Table 5.7.2

Table 5.7.3

Rice. 3.7.36. Residue analysis.

Thus, we obtained a multiple regression equation for the variable
:

Ratio of standard error to mean:

14780/1017890=0,0145 < 0,05.

Since the ratio of the standard error to the mean value does not exceed the recommended value of 0.05, the approximation results can be considered acceptable. As a drawback according to Table 5.7.2, it should be noted that all calculated coefficients exceed the recommended significance level of 0.05.

Let's consider constructing a regression equation of the form .

Compiling a system of normal equations for finding parabolic regression coefficients is carried out similarly to compiling normal linear regression equations.

After transformations we get:

By solving a system of normal equations, the coefficients of the regression equation are obtained.

Where , A .

The second degree equation describes the experimental data significantly better than the first degree equation if the decrease in variance compared to the linear regression variance is significant (non-random). The significance of the difference between and is assessed by the Fisher criterion:

where the number is taken from reference statistical tables (Appendix 1) according to the degrees of freedom and the selected significance level.

The procedure for performing calculation work:

1. Familiarize yourself with theoretical material, set out in the guidelines or in additional literature.

2. Calculate odds linear equation regression. To do this, you need to calculate the amounts. Conveniently calculate amounts immediately , which are useful for calculating the coefficients of a parabolic equation.

3. Calculate the calculated values of the output parameter using the equation.

4. Calculate the total and residual variance, , as well as Fisher’s criterion.

Where – matrix, the elements of which are the coefficients of the system of normal equations;

– a vector whose elements are unknown coefficients;

– matrix of the right-hand sides of the system of equations.

7. Calculate the calculated values of the output parameter using the equation .

8. Calculate the residual variance, as well as Fisher's criterion.

9. Draw conclusions.

10. Construct graphs of regression equations and initial data.

11. Complete settlement work.

Calculation example.

Using experimental data on the dependence of water vapor density on temperature, obtain regression equations of the form and . Conduct statistical analysis and draw a conclusion about the best empirical relationship.


	0,0512	0,0687	0,081	0,1546	0,2516	0,3943	0,5977	0,8795

Processing of experimental data was carried out in accordance with the recommendations for the work. Calculations to determine the parameters of the linear equation are given in Table 1.

Table 1 - Finding the parameters of a linear dependence of the form
Water vapor density at the saturation line
№	t i,°C	, ohm	t i 2		calc.
		0,0512		2,05	-0,0403	-0,0915	0,0084	0,0669
		0,0687		3,16	0,0248	-0,0439	0,0019	0,0582
		0,0811		4,22	0,0899	0,0089	0,0001	0,0523
		0,1546		9,9	0,2202	0,06565	0,0043	0,0241
		0,2516		19,12	0,3505	0,09894	0,0098	0,0034
		0,3943		34,70	0,4808	0,08654	0,0075	0,0071
		0,5977		59,77	0,6111	0,01344	0,0002	0,0829
		0,8795		98,50	0,7414	-0,13807	0,0191	0,3245
sum		2,4786		231,41			0,0512	0,6194
average	72,25	0,3098	5822,5	28,93

b 0 =	-0,4747					D 1 ost 2 =	0,0085
b 1 =	0,0109					Dy 2 =	0,0885
						F=	10,368
F T =3.87 F>F T model is adequate

To determine the parameters of parabolic regression, the elements of the coefficient matrix and the matrix of the right sides of the system of normal equations were first determined. Then the coefficients were calculated in the MathCad environment:

The calculation data are given in Table 2.

Designations in table 2:

conclusions

The parabolic equation describes significantly better the experimental data on the dependence of vapor density on temperature, since the calculated value of the Fisher criterion significantly exceeds the table value of 4.39. Therefore, including a quadratic term in a polynomial equation makes sense.

The results obtained are presented in graphical form (Fig. 3).

Figure 3 – Graphic interpretation of calculation results.

The dotted line is the linear regression equation; solid line – parabolic regression, points on the graph – experimental values.

Table 2. – Finding the parameters of the type of dependence y(t)=a 0 +a 1 ∙x+a 2 ∙x 2	Water vapor density on the saturation line ρ= a 0 +a 1 ∙t+a 2 ∙t 2	(ρ i–ρav) 2	0,0669	0,0582	0,0523	0,0241	0,0034	0,0071	0,0829	0,03245	0,6194
(Δρ) 2	0,0001	0,0000	0,0000	0,0002	0,0000	0,0002	0,0002	0,0002	0,0010		0,0085	0,0002	0,0885	42,5
∆ρ i=ρ( t i)calc–ρ i	0,01194	–0,00446	–0,00377	–0,01524	–0,00235	0,01270	0,011489	–0,01348			D 1 2 rest =	D 2 2 rest =	D 1 2 y=	F=
ρ( t i)calc.	0,0631	0,0642	0,0773	0,1394-	0,2493	0,4070	0,6126	0,8660	2,4788
t i 2ρ i	81,84	145,33	219,21	633,24	1453,2	3053,4	5977,00	11032,45	22595,77
t i 4
t i 3
t iρ i	2,05	3,16	4,22	9,89	19,12	34,70	59,77	98,50	231,41
t i 2
ρ, ohm	0,0512	0,0687	0,0811	0,1546	0,2516	0,3943	0,5977	0,8795	2,4786	0,3098
t i,°C											0,36129	–0,0141	1.6613E-04
№	1	2	3	4	5	6	7	8	sum	average	a 0 =	a 1 =	a 2 =

Annex 1

Fisher distribution table for q = 0,05

f 2										-
f 1
	161,40	199,50	215,70	224,60	230,20	234,00	238,90	243,90	249,00	254,30
	18,51	19,00	19,16	19,25	19,30	19,33	19,37	19,41	19,45	19,50
	10,13	9,55	9,28	9,12	9,01	8,94	8,84	8,74	8,64	8,53
	7,71	6,94	6,59	6,39	6,76	6,16	6,04	5,91	5,77	5,63
	6,61	5,79	5,41	5,19	5,05	4,95	4,82	4,68	4,53	4,36
	5,99	5,14	4,76	4,53	4,39	4,28	4,15	4,00	3,84	3,67
	5,59	4,74	4,35	4,12	3,97	3,87	3,73	3,57	3,41	3,23
	5,32	4,46	4,07	3,84	3,69	3,58	3,44	3,28	3,12	2,93
	5,12	4,26	3,86	3,63	3,48	3,37	3,24	3,07	2,90	2,71
	4,96	4,10	3,71	3,48	3,33	3,22	3,07	2,91	2,74	2,54
	4,84	3,98	3,59	3,36	3,20	3,09	2,95	2,79	2,61	2,40
	4,75	3,88	3,49	3,26	3,11	3,00	2,85	2,69	2,50	2,30
	4,67	3,80	3,41	3,18	3,02	2,92	2,77	2,60	2,42	2,21
	4,60	3,74	3,34	3,11	2,96	2,85	2,70	2,53	2,35	2,13
	4,54	3,68	3,29	3,06	2,90	2,79	2,64	2,48	2,29	2,07
	4,49	3,63	3,24	3,01	2,82	2,74	2,59	2,42	2,24	2,01
	4,45	3,59	3,20	2,96	2,81	2,70	2,55	2,38	2,19	1,96
	4,41	3,55	3,16	2,93	2,77	2,66	2,51	2,34	2,15	1,92
	4,38	3,52	3,13	2,90	2,74	2,63	2,48	2,31	2,11	1,88
	4,35	3,49	3,10	2,87	2,71	2,60	2,45	2,28	2,08	1,84
	4,32	3,47	3,07	2,84	2,68	2,57	2,42	2,25	2,05	1,81
	4,30	3,44	3,05	2,82	2,66	2,55	2,40	2,23	2,03	1,78
	4,28	3,42	3,03	2,80	2,64	2,53	2,38	2,20	2,00	1,76
	4,26	3,40	3,01	2,78	2,62	2,51	2,36	2,18	1,98	1,73
	4,24	3,38	2,99	2,76	2,60	2,49	2,34	2,16	1,96	1,71
	4,22	3,37	2,98	2,74	2,59	2,47	2,32	2,15	1,95	1,69
	4,21	3,35	2,96	2,73	2,57	2,46	2,30	2,13	1,93	1,67
	4,20	3,34	2,95	2,71	2,56	2,44	2,29	2,12	1,91	1,65
	4,18	3,33	2,93	2,70	2,54	2,43	2,28	2,10	1,90	1,64
	4,17	3,32	2,92	2,69	2,53	2,42	2,27	2,09	1,89	1,62
	4,08	3,23	2,84	2,61	2,45	2,34	2,18	2,00	1,79	1,52
	4,00	3,15	2,76	2,52	2,37	2,25	2,10	1,92	1,70	1,39
	3,92	3,07	2,68	2,45	2,29	2,17	2,02	1,88	1,61	1,25

Regression and correlation analysis are statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below, using specific practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when combining them.

Regression Analysis in Excel

Shows the influence of some values (independent, independent) on the dependent variable. For example, how does the number of economically active population depend on the number of enterprises, wages and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to highlight priorities. And based on the main factors, predict, plan the development of priority areas, and make management decisions.

Regression happens:

linear (y = a + bx);
parabolic (y = a + bx + cx 2);
exponential (y = a * exp(bx));
power (y = a*x^b);
hyperbolic (y = b/x + a);
logarithmic (y = b * 1n(x) + a);
exponential (y = a * b^x).

Let's look at an example of building a regression model in Excel and interpreting the results. Let's take the linear type of regression.

Task. At 6 enterprises, the average monthly salary and the number of quitting employees were analyzed. It is necessary to determine the dependence of the number of quitting employees on the average salary.

The linear regression model looks like this:

Y = a 0 + a 1 x 1 +…+a k x k.

Where a are regression coefficients, x are influencing variables, k is the number of factors.

In our example, Y is the indicator of quitting employees. The influencing factor is wages (x).

Excel has built-in functions that can help you calculate the parameters of a linear regression model. But the “Analysis Package” add-on will do this faster.

We activate a powerful analytical tool:

Once activated, the add-on will be available in the Data tab.

Now let's do the regression analysis itself.

First of all, we pay attention to R-squared and coefficients.

R-squared is the coefficient of determination. In our example – 0.755, or 75.5%. This means that the calculated parameters of the model explain 75.5% of the relationship between the studied parameters. The higher the coefficient of determination, the better the model. Good - above 0.8. Bad – less than 0.5 (such an analysis can hardly be considered reasonable). In our example – “not bad”.

The coefficient 64.1428 shows what Y will be if all variables in the model under consideration are equal to 0. That is, the value of the analyzed parameter is also influenced by other factors not described in the model.

The coefficient -0.16285 shows the weight of variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the fewer people quit. Which is fair.

Correlation Analysis in Excel

Correlation analysis helps determine whether there is a relationship between indicators in one or two samples. For example, between the operating time of a machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a connection, then does an increase in one parameter lead to an increase (positive correlation) or a decrease (negative) of the other. Correlation analysis helps the analyst determine whether the value of one indicator can be used to predict the possible value of another.

The correlation coefficient is denoted by r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient is 0, there is no linear relationship between samples.

Let's look at how to find the correlation coefficient using Excel.

To find paired coefficients, the CORREL function is used.

Objective: Determine whether there is a relationship between the operating time of a lathe and the cost of its maintenance.

Place the cursor in any cell and press the fx button.

In the “Statistical” category, select the CORREL function.
Argument “Array 1” - the first range of values – machine operating time: A2:A14.
Argument “Array 2” - second range of values – repair cost: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis several parameters (more than 2), it is more convenient to use “Data Analysis” (the “Analysis Package” add-on). You need to select correlation from the list and designate the array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this:

Correlation and regression analysis

In practice, these two techniques are often used together.

Example:

Now the regression analysis data has become visible.

Let's consider a paired linear regression model of the relationship between two variables, for which the regression function φ(x) linear. Let us denote by y x conditional average of the characteristic Y V population at a fixed value x variable X. Then the regression equation will look like:

y x = ax + b, Where a–regression coefficient(indicator of the slope of the linear regression line) . The regression coefficient shows how many units the variable changes on average Y when changing a variable X for one unit. Using the least squares method, formulas are obtained that can be used to calculate linear regression parameters:

Table 1. Formulas for calculating linear regression parameters

Free member b	Regression coefficient a	Determination coefficient

Testing the hypothesis about the significance of the regression equation
N 0 :	N 1 :
, ,, Appendix 7 (for linear regression p = 1)

The direction of the relationship between variables is determined based on the sign of the regression coefficient. If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive. If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

To analyze the overall quality of the regression equation, the coefficient of determination is used R 2 , also called the square of the multiple correlation coefficient. The coefficient of determination (a measure of certainty) is always within the interval. If the value R 2 close to unity, this means that the constructed model explains almost all the variability in the corresponding variables. Conversely, the meaning R 2 close to zero means poor quality of the constructed model.

Determination coefficient R 2 shows by what percentage the found regression function describes the relationship between the original values Y And X. In Fig. Figure 3 shows the variation explained by the regression model and the total variation. Accordingly, the value shows how many percent of the variation of the parameter Y due to factors not included in the regression model.

With a high value of the coefficient of determination of 75%), a forecast can be made for a specific value within the range of the initial data. When predicting values outside the range of the initial data, the validity of the resulting model cannot be guaranteed. This is explained by the fact that the influence of new factors that the model does not take into account may appear.

The significance of the regression equation is assessed using the Fisher criterion (see Table 1). Provided that the null hypothesis is true, the criterion has a Fisher distribution with the number of degrees of freedom , (for paired linear regression p = 1). If the null hypothesis is rejected, then the regression equation is considered statistically significant. If the null hypothesis is not rejected, then the regression equation is considered statistically insignificant or unreliable.

Example 1. In the machine shop, the structure of product costs and the share of purchased components are analyzed. It was noted that the cost of components depends on the time of their delivery. Distance traveled was selected as the most important factor affecting delivery time. Conduct regression analysis of supply data:

Distance, miles
Time, min

To perform regression analysis:

construct a graph of the initial data, approximately determine the nature of the dependence;

choose the type of regression function and determine the numerical coefficients of the model using the least squares method and the direction of the relationship;

evaluate the strength of regression dependence using the coefficient of determination;

evaluate the significance of the regression equation;

make a forecast (or a conclusion about the impossibility of forecasting) using the adopted model for a distance of 2 miles.

2. Calculate the amounts necessary to calculate the coefficients of the linear regression equation and the coefficient of determinationR 2 :

; ;;.

The required regression dependence has the form: . We determine the direction of the relationship between the variables: the sign of the regression coefficient is positive, therefore, the relationship is also positive, which confirms the graphical assumption.

3. Let's calculate the coefficient of determination: or 92%. Thus, the linear model explains 92% of the variation in delivery time, which means that the factor (distance) was chosen correctly. 8% of the time variation is not explained, which is due to other factors that influence delivery time but are not included in the linear regression model.

4. Let’s check the significance of the regression equation:

Because– the regression equation (linear model) is statistically significant.

5. Let's solve the forecasting problem. Since the coefficient of determinationR 2 has a high enough value and the 2 mile distance for which the prediction is to be made is within the range of the input data, then the prediction can be made:

Regression analysis can be conveniently carried out using the capabilities Excel. The "Regression" operating mode is used to calculate the parameters of the linear regression equation and check its adequacy for the process under study. In the dialog box, fill in the following parameters:

Example 2. Complete the task of example 1 using the "Regression" modeExcel.

CONCLUSION OF RESULTS

Regression statistics
Plural R
R-square
Normalized R-squared
Standard error
Observations
	Odds	Standard error	t-statistic	P-Value
Y-intersection
Variable X 1

Let's look at the results of the regression analysis presented in the table.

MagnitudeR-square , also called the measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the source data and the regression model (calculated data). In our example, the measure of certainty is 0.91829, which indicates a very good fit of the regression line to the original data and coincides with the coefficient of determinationR 2 , calculated by the formula.

Plural R - multiple correlation coefficient R - expresses the degree of dependence of the independent variables (X) and the dependent variable (Y) and is equal to the square root of the coefficient of determination. In simple linear regression analysismultiple R coefficientequal to the linear correlation coefficient (r = 0,958).

Linear model coefficients:Y -intersection prints the value of the dummy termb, Avariable X1 – regression coefficient a. Then the linear regression equation is:

y = 2.6597x+ 5.9135 (which agrees well with the calculation results in example 1).

Next, let’s check the significance of the regression coefficients:aAndb. Comparing column values in pairs Odds And Standard error In the table, we see that the absolute values of the coefficients are greater than their standard errors. In addition, these coefficients are significant, as can be judged by the values of the P-value indicator, which are less than the specified significance level α = 0.05.

Observation	Predicted Y	Leftovers	Standard balances

The table shows the output resultsleftovers. Using this part of the report, we can see the deviations of each point from the constructed regression line. Largest absolute valueremainderin this case - 1.89256, the smallest - 0.05399. To better interpret this data, plot the original data and the constructed regression line. As can be seen from the construction, the regression line is well “fitted” to the values of the initial data, and the deviations are random.

	Retail food price index (x)	Industrial production index (y)
1	100	70
2	105	79
3	108	85
4	113	84
5	118	85
6	118	85
7	110	96
8	115	99
9	119	100
10	118	98
11	120	99
12	124	102
13	129	105
14	132	112

	Retail food price index (x)	Industrial production index (y)
1	100	70
2	105	79
3	108	85
4	113	84
5	118	85
6	118	85
7	110	96
8	115	99
9	119	100
10	118	98
11	120	99
12	124	102
13	129	105
14	132	112

	Retail food price index (x)	Industrial production index (y)
1	100	70
2	105	79
3	108	85
4	113	84
5	118	85
6	118	85
7	110	96
8	115	99
9	119	100
10	118	98
11	120	99
12	124	102
13	129	105
14	132	112