# Regression

## You are here

## Regression

Sometimes in statistics we need to compare 2 sets of data from the same source, we can do this by means of a **scatter diagram**. To plot a scatter diagram for 2 sets of data **x** and **y**, we plot each pair of corresponding points. Data like this is called **bivariate data**.

**Example:**

The following table gives the test results for the first 2 tests on S1, probability and discrete random variables, for 12 sixth form students...

Student: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |

Prob (x): | 65 | 88 | 83 | 92 | 50 | 67 | 100 | 100 | 73 | 90 | 83 | 94 |

D.R.V (y): | 52 | 57 | 78 | 76 | 30 | 67 | 96 | 74 | 65 | 87 | 78 | 89 |

To plot the scatter diagram, we plot their probability score on the x-axis against their discrete random variable score on the y-axis.

Therefore, plot (65, 52) and (88, 57) etc... for all 12 students.

As you can see from the diagram, there appears to be a trend in the scatter. The points seem to lie along the same diagonal line, this is called the 'line of best fit'. There are of course the obvious exceptions which seem to lie a little too far from the line (these are ringed on the scatter diagram).

The values are given by

**Remember:** ∑ means 'sum of'.

In this example,

If all (or nearly all) of these points seem to lie in a straight line then there is said to be a **linear correlation** between x and y. (Correlation is like a link or connection.)

**There are 3 types of scatter we can obtain:**

If our scatter diagram shows a correlation between the 2 sets of data then we can add a **line of regression**. These are pretty much 'lines of best fit' (as seen above) but are more accurately calculated. However, unlike GCSE, if we have a fair degree of scatter we often draw/calculate 2 regression lines.

These lines are:

**Regression line y on x**

Used to estimate y, taking x to be accurate. This line is calculated by finding the least sum of the squares of the vertical distances from the points. Let's look at the following diagram to explain this...

The vertical distance from each point to the line is squared and added to each other result. The line that has the least total will be the **regression line y on x**.

**Regression line x on y.
**

Used to estimate x, taking y to be accurate. This line is calculated by finding the least sum of the squares of the horizontal distances from the points. The following diagram will explain this further...

The horizontal distance from each point to the line is squared and added to each other result. The line that has the least total will be the **regression line x on y**.

**Regression line y on x**

This equation will have the formula:

Not surprisingly, this is the equation of a straight line!

Where a and b are calculated using the following formulae:

**Regression line x on y**

This equation will have the formula:

Where a and b are calculated using these formulae:

Note the similarities between S_{xy}, S_{xx} and S_{yy} in these formulae. The formulae may look tricky but in actual fact are quite easy and straightforward to use. The following example will demonstrate this.

**Example:
**

We will use the data seen earlier of the test results for the first 2 tests on S1, probability and discrete random variables, for 12 sixth form students. We will then calculate the regression lines x on y, and y on x.

Student: | 1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |

Prob (x): | 65 | 88 | 83 | 92 | 50 | 67 | 100 | 100 | 73 | 90 | 83 | 94 |

D.R.V (y): | 52 | 57 | 78 | 76 | 30 | 67 | 96 | 74 | 65 | 87 | 78 | 89 |

To use the formulae above we need to first calculate:

Σx | Σy | Σx^{2} |
Σy^{2} |
Σxy |

This can be done by adding extra columns to our table:

Student: | 1 |
2 |
3 |
4 |
5 |
6 |

Prob (x): | 65 | 88 | 83 | 92 | 50 | 67 |

D.R.V (y): | 52 | 57 | 78 | 76 | 30 | 67 |

x^{2} |
4225 | 7744 | 6889 | 8464 | 2500 | 4489 |

y^{2} |
2704 | 3249 | 6084 | 5776 | 900 | 4489 |

xy | 3380 | 5016 | 6474 | 6992 | 1500 | 4489 |

Student: | 1 |
2 |
3 |
4 |
5 |
6 |

Prob (x): | 100 | 100 | 73 | 90 | 83 | 94 |

D.R.V (y): | 96 | 74 | 65 | 87 | 78 | 89 |

x^{2} |
10000 | 10000 | 5329 | 8100 | 6889 | 8836 |

y^{2} |
9216 | 5476 | 4225 | 7569 | 6084 | 7921 |

xy | 9600 | 7400 | 4745 | 7830 | 6474 | 8366 |

Summing each row we obtain the following data:

n = 12 | Σ x = 985 | Σy = 849 |

Σx^{2} = 83465 |
Σ y^{2} = 63693 |
Σxy = 72266 |

**Regression line y on x**

Let's start with b.

Now we have ** b**, let's find **a** where,

So, a = -10.2

**The line of regression y on x is y = a + bx**

So y = 0.986x - 10.2

**Regression line x on y
**

Let's start with b^{|}.

**The line of regression x on y is x= a ^{|} + b^{|} y**

So, x = 31.8 + 0.711y

Let's suppose a student Vikki was absent for the 1^{st} test but got 52 for the 2^{nd}.

**Which line could we use to estimate her 1 ^{st} test result?**

The line x on y means estimate x (1^{st} test) given y (2^{nd} test) accurate. This line looks best as we are given the 2^{nd} test mark (accurate) and are asked to estimate the 1^{st}.

Using x on y we have **x = 31.8 + 0.711y**

Hence x = 31.8 + 0.711 × 52

= 68.772 or approximately a mark of 69

Let's suppose we had the data:

x: | 1 | 2 | 3 | 4 | 5 |

y: | 19 | 23 | 23 | 27 | 30 |

With the above data, ** x** looks to be controlled, where **y** appears to be dependent on an experiment and **x**. In this case, we say that **x** is an independent variable and **y** a dependent variable. As **x** appears controlled and accurate we only need to calculate the regression line **y** on **x**.

In the example above we were given the **raw data**, which we calculated our **summarised data** from. We did this by adding extra columns to our raw data and then systematically working out each value of x^{2}, y^{2} and xy before adding each row.

This method is not only time consuming but incredibly boring!

The quickest way to deal with our raw data is to plug it straight into our calculators and let our calculators work out not only the summarised data, but also the values of **a** and **b**, or **a ^{|}** and

**b**.

^{|}Your calculator will need to be in **linear regression mode**. You will need to work out for yourselves (with help from your teachers) how to use this function as most calculators use different keying sequences.

* Warning:* In A-level questions you could be given raw data or summarised data. This means you should have the knowledge how to use your calculators for the raw data, but also a working knowledge of the formulae.

**Do not**just rely on the calculator method.