Abstracts Statements Story

Least squares method in Excel slope energy. Least squares method in Excel

Method least squares(LSM) is based on minimizing the sum of squared deviations of the selected function from the data under study. In this article we will approximate the available data using a linear functiony = a x + b .

Least square method(English) Ordinary Least Squares , O.L.S.) is one of the basic methods of regression analysis in terms of estimating unknown parameters regression models according to sample data.

Let's consider approximation by functions that depend only on one variable:

  • Linear: y=ax+b (this article)
  • : y=a*Ln(x)+b
  • : y=a*x m
  • : y=a*EXP(b*x)+с
  • : y=ax 2 +bx+c

Note: Cases of approximation by a polynomial from the 3rd to the 6th degree are considered in this article. Approximation by a trigonometric polynomial is considered here.

Linear dependence

We are interested in the connection between 2 variables X And y. There is an assumption that y depends on X according to linear law y = ax + b. To determine the parameters of this relationship, the researcher made observations: for each value of x i, a measurement of y i was made (see example file). Accordingly, let there be 20 pairs of values ​​(x i; y i).

Note: If the change step is X is constant, then to build scatter plots can be used, if not, then you need to use the chart type Spot .

It is obvious from the diagram that the relationship between the variables is close to linear. To understand which of the many straight lines most “correctly” describes the relationship between variables, it is necessary to determine the criterion by which the lines will be compared.

As such a criterion we use the expression:

Where ŷ i = a * x i + b ; n – number of pairs of values ​​(in our case n=20)

The above expression is the sum of the squared distances between the observed values ​​of y i and ŷ i and is often denoted as SSE ( Sum of Squared Errors (Residuals), sum of squared errors (residuals)) .

Least square method is to select such a line ŷ = ax + b, for which the above expression takes the minimum value.

Note: Any line in two-dimensional space is uniquely determined by the values ​​of 2 parameters: a (slope) and b (shift).

It is believed that the smaller the sum of squared distances, the better the corresponding line approximates the available data and can be further used to predict the values ​​of y from the x variable. It is clear that even if in reality there is no relationship between the variables or the relationship is nonlinear, then OLS will still select the “best” line. Thus, the least squares method does not say anything about the presence of a real relationship between variables; the method simply allows you to select such function parameters a And b , for which the above expression is minimal.

By performing not very complex mathematical operations (for more details, see), you can calculate the parameters a And b :

As can be seen from the formula, the parameter a represents the ratio of covariance and , therefore in MS EXCEL to calculate the parameter A You can use the following formulas (see Linear sheet example file):

= KOVAR(B26:B45;C26:C45)/ DISP.G(B26:B45) or

= COVARIANCE.B(B26:B45;C26:C45)/DISP.B(B26:B45)

Also to calculate the parameter A you can use the formula = TILT(C26:C45;B26:B45). For parameter b use the formula = LEG(C26:C45;B26:B45) .

Finally, the LINEST() function allows you to calculate both parameters at once. To enter a formula LINEST(C26:C45;B26:B45) You need to select 2 cells in a row and click CTRL + SHIFT + ENTER(see article about). The value will be returned in the left cell A , on the right – b .

Note: To avoid messing with the input array formulas you will need to additionally use the INDEX() function. Formula = INDEX(LINEST(C26:C45,B26:B45),1) or just = LINEST(C26:C45;B26:B45) will return the parameter responsible for the slope of the line, i.e. A . Formula = INDEX(LINEST(C26:C45,B26:B45),2) will return the parameter responsible for the intersection of the line with the Y axis, i.e. b .

Having calculated the parameters, scatter diagram you can draw the corresponding line.

Another way to plot a straight line using the least squares method is the graph tool Trend line. To do this, select the diagram, select from the menu Layout tab, V group Analysis click Trend line, then Linear approximation .

By checking the “show equation in diagram” box in the dialog box, you can make sure that the parameters found above match the values ​​in the diagram.

Note: In order for the parameters to match, the diagram type must be . The point is that when constructing a diagram Schedule X-axis values ​​cannot be specified by the user (the user can only specify labels that do not affect the location of the points). Instead of X values, the sequence 1 is used; 2; 3; ... (for numbering categories). Therefore, if you build trend line on a type diagram Schedule, then instead of the actual values ​​of X the values ​​of this sequence will be used, which will lead to an incorrect result (unless, of course, the actual values ​​of X do not coincide with the sequence 1; 2; 3; ...).

Well, at work we reported to the inspection, the article was written at home for the conference - now we can write on the blog. While I was processing my data, I realized that I couldn’t help but write about a very cool and necessary add-in in Excel called . So the article will be devoted to this particular add-on, and I will tell you about it using an example of use least squares method(LSM) to search for unknown equation coefficients when describing experimental data.

How to enable the “search for solution” add-on

First, let's figure out how to enable this add-on.

1. Go to the “File” menu and select “Excel Options”

2. In the window that appears, select “Search for a solution” and click “go”.

3. In the next window, check the box next to “search for a solution” and click “OK”.

4. The add-in is activated - now it can be found in the “Data” menu item.

Least square method

Now briefly about least squares method (LSM) and where it can be used.

Let's say we have a set of data after we performed some kind of experiment, where we studied the influence of the value X on the value Y.

We want to describe this influence mathematically, so that we can then use this formula and know that if we change the value of X by so much, we will get the value of Y such and such...

I'll take a super-simple example (see figure).

It’s a no brainer that the points are located one after another as if in a straight line, and therefore we safely assume that our dependence is described linear function y=kx+b. At the same time, we are absolutely sure that when X is equal to zero, the value of Y is also equal to zero. This means that the function describing the dependence will be even simpler: y=kx (remember the school curriculum).

In general, we have to find the coefficient k. This is what we will do with MNC using the “solution search” add-on.

The method is that (here - attention: you need to think about it) the sum of the squares of the differences between the experimentally obtained and the corresponding calculated values ​​is minimal. That is, when X1=1 the actual measured value Y1=4.6, and the calculated y1=f (x1) is equal to 4, the square of the difference will be (y1-Y1)^2=(4-4.6)^2=0.36 . It’s the same with the following: when X2=2, the actual measured value of Y2=8.1, and the calculated y2 is 8, the square of the difference will be (y2-Y2)^2=(8-8.1)^2=0.01. And the sum of all these squares should be as small as possible.

So, let's start training on using LSM and Excel add-ins "search for solution" .

Applying the add-in to find a solution

1. If you haven’t enabled the “search for solution” add-on, then go back to point How to enable the “search for solution” add-on and turn it on 🙂

2. In cell A1, enter the value “1”. This unit will be the first approximation to the real value of the coefficient (k) of our functional relationship y=kx.

3. In column B we have the values ​​of the parameter X, in column C we have the values ​​of the parameter Y. In the cells of column D we enter the formula: “coefficient k multiplied by the value X.” For example, in cell D1 we enter “=A1*B1”, in cell D2 we enter “=A1*B2”, etc.

4. We believe that the coefficient k is equal to one and the function f (x)=y=1*x is the first approximation to our solution. We can calculate the sum of squared differences between the measured values ​​of Y and those calculated using the formula y=1*x. We can do all this manually by entering the corresponding cell references into the formula: "=(D2-C2)^2+(D3-C3)^2+(D4-C4)^2... etc. In the end we make a mistake and realize that we have wasted a lot of time. In Excel, to calculate the sum of squared differences, there is a special formula, “SUMQUARRENT", which will do everything for us. Enter it into cell A2 and set the initial data: the range of measured values ​​Y (column C) and range of calculated Y values ​​(column D).

4. The sum of the differences of squares has been calculated - now go to the “Data” tab and select “Search for a solution”.

5. In the menu that appears, select cell A1 (the one with the coefficient k) as the cell to be changed.

6. Select cell A2 as the target and set the condition “set equal to the minimum value”. We remember that this is the cell where we calculate the sum of the squares of the differences between the calculated and measured values, and this sum should be minimal. Click “execute”.

7. Coefficient k has been selected. Now you can verify that the calculated values ​​are now very close to the measured ones.

P.S.

In general, of course, to approximate experimental data in Excel, there are special tools that allow you to describe data using linear, exponential, power and polynomial functions, so you can often do without “search for solution” add-ons. I talked about all these approximation methods in mine, so if you’re interested, take a look. But when it comes to some exotic function with one unknown coefficient or optimization problems, then here superstructure couldn't come at a better time.

Solution search add-on can be used for other tasks, the main thing is to understand the essence: there is a cell where we select a value, and there is a target cell in which the condition for selecting an unknown parameter is specified.
That's all! In the next article I’ll tell you a fairy tale about a vacation, so in order not to miss the publication of the article,

It has many applications, as it allows an approximate representation of a given function by other simpler ones. LSM can be extremely useful in processing observations, and it is actively used to estimate some quantities based on the results of measurements of others containing random errors. In this article, you will learn how to implement least squares calculations in Excel.

Statement of the problem using a specific example

Suppose there are two indicators X and Y. Moreover, Y depends on X. Since OLS interests us from the point of view of regression analysis (in Excel its methods are implemented using built-in functions), we should immediately move on to considering a specific problem.

So, let X be the retail space of a grocery store, measured in square meters, and Y be the annual turnover, measured in millions of rubles.

It is required to make a forecast of what turnover (Y) the store will have if it has this or that retail space. Obviously, the function Y = f (X) is increasing, since the hypermarket sells more goods than the stall.

A few words about the correctness of the initial data used for prediction

Let's say we have a table built using data for n stores.

According to mathematical statistics, the results will be more or less correct if data on at least 5-6 objects is examined. In addition, “anomalous” results cannot be used. In particular, an elite small boutique can have a turnover that is several times greater than the turnover of large retail outlets of the “masmarket” class.

The essence of the method

The table data can be depicted on a Cartesian plane in the form of points M 1 (x 1, y 1), ... M n (x n, y n). Now the solution to the problem will be reduced to the selection of an approximating function y = f (x), which has a graph passing as close as possible to the points M 1, M 2, .. M n.

Of course, you can use a high-degree polynomial, but this option is not only difficult to implement, but also simply incorrect, since it will not reflect the main trend that needs to be detected. The most reasonable solution is to search for the straight line y = ax + b, which best approximates the experimental data, or more precisely, the coefficients a and b.

Accuracy assessment

With any approximation, assessing its accuracy is of particular importance. Let us denote by e i the difference (deviation) between the functional and experimental values ​​for point x i, i.e. e i = y i - f (x i).

Obviously, to assess the accuracy of the approximation, you can use the sum of deviations, i.e., when choosing a straight line for an approximate representation of the dependence of X on Y, you should give preference to the one with the smallest value of the sum e i at all points under consideration. However, not everything is so simple, since along with positive deviations there will also be negative ones.

The issue can be solved using deviation modules or their squares. The last method is the most widely used. It is used in many areas, including regression analysis (implemented in Excel using two built-in functions), and has long proven its effectiveness.

Least square method

Excel, as you know, has a built-in AutoSum function that allows you to calculate the values ​​of all values ​​located in the selected range. Thus, nothing will prevent us from calculating the value of the expression (e 1 2 + e 2 2 + e 3 2 + ... e n 2).

In mathematical notation this looks like:

Since the decision was initially made to approximate using a straight line, we have:

Thus, the task of finding the straight line that best describes the specific dependence of the quantities X and Y comes down to calculating the minimum of a function of two variables:

To do this, you need to equate the partial derivatives with respect to the new variables a and b to zero, and solve a primitive system consisting of two equations with 2 unknowns of the form:

After some simple transformations, including division by 2 and manipulation of sums, we get:

Solving it, for example, using Cramer’s method, we obtain a stationary point with certain coefficients a * and b *. This is the minimum, i.e. to predict what turnover a store will have for a certain area, the straight line y = a * x + b * is suitable, which is a regression model for the example in question. Of course, it will not allow you to find the exact result, but it will help you get an idea of ​​whether purchasing a specific area on store credit will pay off.

How to Implement Least Squares in Excel

Excel has a function for calculating values ​​using least squares. It has the following form: “TREND” (known Y values; known X values; new X values; constant). Let's apply the formula for calculating OLS in Excel to our table.

To do this, enter the “=” sign in the cell in which the result of the calculation using the least squares method in Excel should be displayed and select the “TREND” function. In the window that opens, fill in the appropriate fields, highlighting:

  • range of known values ​​for Y (in this case, data for trade turnover);
  • range x 1 , …x n , i.e. the size of retail space;
  • both known and unknown values ​​of x, for which you need to find out the size of the turnover (for information about their location on the worksheet, see below).

In addition, the formula contains the logical variable “Const”. If you enter 1 in the corresponding field, this will mean that you should carry out the calculations, assuming that b = 0.

If you need to find out the forecast for more than one x value, then after entering the formula you should not press “Enter”, but you need to type the combination “Shift” + “Control” + “Enter” on the keyboard.

Some features

Regression analysis can be accessible even to dummies. The Excel formula for predicting the value of an array of unknown variables—TREND—can be used even by those who have never heard of least squares. It is enough just to know some of the features of its work. In particular:

  • If you arrange the range of known values ​​of the variable y in one row or column, then each row (column) with known values ​​of x will be perceived by the program as a separate variable.
  • If a range with known x is not specified in the TREND window, then when using the function in Excel, the program will treat it as an array consisting of integers, the number of which corresponds to the range with the given values ​​of the variable y.
  • To output an array of “predicted” values, the expression for calculating the trend must be entered as an array formula.
  • If new values ​​of x are not specified, then the TREND function considers them equal to the known ones. If they are not specified, then array 1 is taken as an argument; 2; 3; 4;…, which is commensurate with the range with already specified parameters y.
  • The range containing the new x values ​​must have the same or more rows or columns as the range containing the given y values. In other words, it must be proportional to the independent variables.
  • An array with known x values ​​can contain multiple variables. However, if we are talking about only one, then it is required that the ranges with the given values ​​of x and y be proportional. In the case of several variables, it is necessary that the range with the given y values ​​fit in one column or one row.

PREDICTION function

Implemented using several functions. One of them is called “PREDICTION”. It is similar to “TREND”, i.e. it gives the result of calculations using the least squares method. However, only for one X, for which the value of Y is unknown.

Now you know formulas in Excel for dummies that allow you to predict the future value of a particular indicator according to a linear trend.

The least squares method is a mathematical procedure for constructing linear equation, which would most closely match a set of two series of numbers. The purpose of using this method is to minimize the total square error. Excel has tools that you can use to this method during calculations. Let's figure out how this is done.

The least squares method (LSM) is a mathematical description of the dependence of one variable on another. It can be used for forecasting.

Enabling the Find Solution add-on

In order to use MNC in Excel, you need to enable the add-in "Finding a solution", which is disabled by default.


Now the function Finding a solution in Excel is activated, and its tools appear on the ribbon.

Conditions of the problem

Let us describe the use of LSM using a specific example. We have two rows of numbers x And y , the sequence of which is shown in the image below.

This dependence can be most accurately described by the function:

At the same time, it is known that when x=0 y also equal 0 . Therefore, this equation can be described by the dependence y=nx .

We have to find the minimum sum of squares of the difference.

Solution

Let's move on to a description of the direct application of the method.


As you can see, the application of the least squares method is a rather complex mathematical procedure. We showed it in action using a simple example, but there are much more complex cases. However, Microsoft Excel tools are designed to simplify the calculations as much as possible.

The method of least squares (OLS) belongs to the field of regression analysis. It has many applications, as it allows an approximate representation of a given function by other simpler ones. LSM can be extremely useful in processing observations, and it is actively used to estimate some quantities based on the results of measurements of others containing random errors. In this article, you will learn how to implement least squares calculations in Excel.

Statement of the problem using a specific example

Suppose there are two indicators X and Y. Moreover, Y depends on X. Since OLS interests us from the point of view of regression analysis (in Excel its methods are implemented using built-in functions), we should immediately move on to considering a specific problem.

So, let X be the retail space of a grocery store, measured in square meters, and Y be the annual turnover, measured in millions of rubles.

It is required to make a forecast of what turnover (Y) the store will have if it has this or that retail space. Obviously, the function Y = f (X) is increasing, since the hypermarket sells more goods than the stall.

A few words about the correctness of the initial data used for prediction

Let's say we have a table built using data for n stores.

According to mathematical statistics, the results will be more or less correct if data on at least 5-6 objects is examined. In addition, “anomalous” results cannot be used. In particular, an elite small boutique can have a turnover that is several times greater than the turnover of large retail outlets of the “masmarket” class.

The essence of the method

The table data can be depicted on a Cartesian plane in the form of points M 1 (x 1, y 1), ... M n (x n, y n). Now the solution to the problem will be reduced to the selection of an approximating function y = f (x), which has a graph passing as close as possible to the points M 1, M 2, .. M n.

Of course, you can use a high-degree polynomial, but this option is not only difficult to implement, but also simply incorrect, since it will not reflect the main trend that needs to be detected. The most reasonable solution is to search for the straight line y = ax + b, which best approximates the experimental data, or more precisely, the coefficients a and b.

Accuracy assessment

With any approximation, assessing its accuracy is of particular importance. Let us denote by e i the difference (deviation) between the functional and experimental values ​​for point x i, i.e. e i = y i - f (x i).

Obviously, to assess the accuracy of the approximation, you can use the sum of deviations, i.e., when choosing a straight line for an approximate representation of the dependence of X on Y, you should give preference to the one with the smallest value of the sum e i at all points under consideration. However, not everything is so simple, since along with positive deviations there will also be negative ones.

The issue can be solved using deviation modules or their squares. The last method is the most widely used. It is used in many areas, including regression analysis (implemented in Excel using two built-in functions), and has long proven its effectiveness.

Least square method

Excel, as you know, has a built-in AutoSum function that allows you to calculate the values ​​of all values ​​located in the selected range. Thus, nothing will prevent us from calculating the value of the expression (e 1 2 + e 2 2 + e 3 2 + ... e n 2).

In mathematical notation this looks like:

Since the decision was initially made to approximate using a straight line, we have:

Thus, the task of finding the straight line that best describes the specific dependence of the quantities X and Y comes down to calculating the minimum of a function of two variables:

To do this, you need to equate the partial derivatives with respect to the new variables a and b to zero, and solve a primitive system consisting of two equations with 2 unknowns of the form:

After some simple transformations, including division by 2 and manipulation of sums, we get:

Solving it, for example, using Cramer’s method, we obtain a stationary point with certain coefficients a * and b *. This is the minimum, i.e. to predict what turnover a store will have for a certain area, the straight line y = a * x + b * is suitable, which is a regression model for the example in question. Of course, it will not allow you to find the exact result, but it will help you get an idea of ​​whether purchasing a specific area on store credit will pay off.

How to Implement Least Squares in Excel

Excel has a function for calculating values ​​using least squares. It has the following form: “TREND” (known Y values; known X values; new X values; constant). Let's apply the formula for calculating OLS in Excel to our table.

To do this, enter the “=” sign in the cell in which the result of the calculation using the least squares method in Excel should be displayed and select the “TREND” function. In the window that opens, fill in the appropriate fields, highlighting:

  • range of known values ​​for Y (in this case, data for trade turnover);
  • range x 1 , …x n , i.e. the size of retail space;
  • both known and unknown values ​​of x, for which you need to find out the size of the turnover (for information about their location on the worksheet, see below).

In addition, the formula contains the logical variable “Const”. If you enter 1 in the corresponding field, this will mean that you should carry out the calculations, assuming that b = 0.

If you need to find out the forecast for more than one x value, then after entering the formula you should not press “Enter”, but you need to type the combination “Shift” + “Control” + “Enter” on the keyboard.

Some features

Regression analysis can be accessible even to dummies. The Excel formula for predicting the value of an array of unknown variables—TREND—can be used even by those who have never heard of least squares. It is enough just to know some of the features of its work. In particular:

  • If you arrange the range of known values ​​of the variable y in one row or column, then each row (column) with known values ​​of x will be perceived by the program as a separate variable.
  • If a range with known x is not specified in the TREND window, then when using the function in Excel, the program will treat it as an array consisting of integers, the number of which corresponds to the range with the given values ​​of the variable y.
  • To output an array of “predicted” values, the expression for calculating the trend must be entered as an array formula.
  • If new values ​​of x are not specified, then the TREND function considers them equal to the known ones. If they are not specified, then array 1 is taken as an argument; 2; 3; 4;…, which is commensurate with the range with already specified parameters y.
  • The range containing the new x values ​​must have the same or more rows or columns as the range containing the given y values. In other words, it must be proportional to the independent variables.
  • An array with known x values ​​can contain multiple variables. However, if we are talking about only one, then it is required that the ranges with the given values ​​of x and y be proportional. In the case of several variables, it is necessary that the range with the given y values ​​fit in one column or one row.

PREDICTION function

Regression analysis in Excel is implemented using several functions. One of them is called “PREDICTION”. It is similar to “TREND”, i.e. it gives the result of calculations using the least squares method. However, only for one X, for which the value of Y is unknown.

Now you know formulas in Excel for dummies that allow you to predict the future value of a particular indicator according to a linear trend.