Regression

Sometimes in statistics we need to compare 2 sets of data from the same source, we can do this by means of a scatter diagram. To plot a scatter diagram for 2 sets of data x and y, we plot each pair of corresponding points. Data like this is called bivariate data.

Example:

The following table gives the test results for the first 2 tests on S1, probability and discrete random variables, for 12 sixth form students...

Student: 1 2 3 4 5 6 7 8 9 10 11 12
Prob (x): 65 88 83 92 50 67 100 100 73 90 83 94
D.R.V (y): 52 57 78 76 30 67 96 74 65 87 78 89

To plot the scatter diagram, we plot their probability score on the x-axis against their discrete random variable score on the y-axis.

Therefore, plot (65, 52) and (88, 57) etc... for all 12 students.

Scatter diagram

As you can see from the diagram, there appears to be a trend in the scatter. The points seem to lie along the same diagonal line, this is called the 'line of best fit'. There are of course the obvious exceptions which seem to lie a little too far from the line (these are ringed on the scatter diagram).

Point

The values are given by

Means

Remember: ∑ means 'sum of'.

In this example,

Example

If all (or nearly all) of these points seem to lie in a straight line then there is said to be a linear correlation between x and y. (Correlation is like a link or connection.)

There are 3 types of scatter we can obtain:

3 types of scatter

If our scatter diagram shows a correlation between the 2 sets of data then we can add a line of regression. These are pretty much 'lines of best fit' (as seen above) but are more accurately calculated. However, unlike GCSE, if we have a fair degree of scatter we often draw/calculate 2 regression lines.

These lines are:

Regression line y on x

Used to estimate y, taking x to be accurate. This line is calculated by finding the least sum of the squares of the vertical distances from the points. Let's look at the following diagram to explain this...

Regression line y on x

The vertical distance from each point to the line is squared and added to each other result. The line that has the least total will be the regression line y on x.

Regression line x on y.

Used to estimate x, taking y to be accurate. This line is calculated by finding the least sum of the squares of the horizontal distances from the points. The following diagram will explain this further...

Regression line x on y

The horizontal distance from each point to the line is squared and added to each other result. The line that has the least total will be the regression line x on y.

Regression line y on x

This equation will have the formula:

Equation for regression line y on x

Not surprisingly, this is the equation of a straight line!

Where a and b are calculated using the following formulae:

Formulae

Regression line x on y

This equation will have the formula:

Regression line x on y

Where a and b are calculated using these formulae:

Regression line x on y

Note the similarities between Sxy, Sxx and Syy in these formulae. The formulae may look tricky but in actual fact are quite easy and straightforward to use. The following example will demonstrate this.

Example:

We will use the data seen earlier of the test results for the first 2 tests on S1, probability and discrete random variables, for 12 sixth form students. We will then calculate the regression lines x on y, and y on x.

Student: 1 2 3 4 5 6 7 8 9 10 11 12
Prob (x): 65 88 83 92 50 67 100 100 73 90 83 94
D.R.V (y): 52 57 78 76 30 67 96 74 65 87 78 89

To use the formulae above we need to first calculate:

Σx Σy Σx2 Σy2 Σxy

This can be done by adding extra columns to our table:

Student: 1 2 3 4 5 6
Prob (x): 65 88 83 92 50 67
D.R.V (y): 52 57 78 76 30 67
x2 4225 7744 6889 8464 2500 4489
y2 2704 3249 6084 5776 900 4489
xy 3380 5016 6474 6992 1500 4489

Student: 1 2 3 4 5 6
Prob (x): 100 100 73 90 83 94
D.R.V (y): 96 74 65 87 78 89
x2 10000 10000 5329 8100 6889 8836
y2 9216 5476 4225 7569 6084 7921
xy 9600 7400 4745 7830 6474 8366

Summing each row we obtain the following data:

n = 12 Σ x = 985 Σy = 849
Σx2 = 83465 Σ y2 = 63693 Σxy = 72266

Regression line y on x

Let's start with b.

Regression line y on x

Now we have b, let's find a where,

Regression line y on x

So, a = -10.2

The line of regression y on x is y = a + bx

So y = 0.986x - 10.2

Regression line x on y

Let's start with b|.

Regression line x on y

The line of regression x on y is x= a| + b| y

So, x = 31.8 + 0.711y

Let's suppose a student Vikki was absent for the 1st test but got 52 for the 2nd.

Using our lines of regression

Which line could we use to estimate her 1st test result?

The line x on y means estimate x (1st test) given y (2nd test) accurate. This line looks best as we are given the 2nd test mark (accurate) and are asked to estimate the 1st.

Using x on y we have x = 31.8 + 0.711y

Hence x = 31.8 + 0.711 × 52

= 68.772 or approximately a mark of 69

Let's suppose we had the data:

x: 1 2 3 4 5
y: 19 23 23 27 30

With the above data, x looks to be controlled, where y appears to be dependent on an experiment and x. In this case, we say that x is an independent variable and y a dependent variable. As x appears controlled and accurate we only need to calculate the regression line y on x.

In the example above we were given the raw data, which we calculated our summarised data from. We did this by adding extra columns to our raw data and then systematically working out each value of x2, y2 and xy before adding each row.

This method is not only time consuming but incredibly boring!

The quickest way to deal with our raw data is to plug it straight into our calculators and let our calculators work out not only the summarised data, but also the values of a and b, or a| and b|.

Your calculator will need to be in linear regression mode. You will need to work out for yourselves (with help from your teachers) how to use this function as most calculators use different keying sequences.

Warning: In A-level questions you could be given raw data or summarised data. This means you should have the knowledge how to use your calculators for the raw data, but also a working knowledge of the formulae. Do not just rely on the calculator method.

S-cool Beat Stress

Your revision and your exam period is stressful time for most students! This App contains practical and powerful stress-busting strategies to keep you calm and composed so that you deliver your best work in the exam.

Get the full iPhone app