|
Draw a line through the middle of a cloud of data points that is a "best fit" to the data.
This explanation looks at regression solely as a descriptive statistic: what is the line which lies "closest" to a given set of points. "Closest" means minimizing the sum of the squared y (vertical) distance of the points from the least squares regression line. I won't derive the formula, merely present it and then use it. Data is given as a set of points in the plane, i.e., as ordered pairs of x and y values.
X-bar, written as an X with a line over it, is the mean (average) of the x-values.
Y-bar, a Y with a line over it, is the mean of the y-values.
SSxx is the sum of the squares of the x-deviations. SUM (xi-(X-bar))²
SSyy is the sum of the squares of the y-deviations. SUM (yi-(Y-bar))²
SSxy is SUM (xi-(X-bar))(yi-(Y-bar))
b1 = SSxy/SSxx
b0 = (Y-bar) - b1(X-bar)
The least squares regression line is y-hat = b0 + b1x
(y-hat is written as a y with a circumflex over it.)
| Data Values | |
|---|---|
| x | y |
| 2 | -5 |
| 4 | 14 |
| 9 | -1 |
| 13 | 38 |
| 16 | 11 |
|
The formula for the least squares regression line is
y-hat = b0 + b1x
So in our example, where b0=-1.622 and b1=1.48, the least squares regression line is
y-hat = -1.622 + (1.48)x
The webmaster and author of this Math Help site is Graeme McRae.