The four big statistics tools

Statistics is used in almost every branch of science and yet is almost unknown to everyone not a scientist. Or mathematician.

The saddest part of the story is the large number of people trained in statistics who have no knowlege of the art.

As is the case with many other disciplines, the lack of knowledge results from making a simple thing complex--largely for ego reasons. Keeping the simple stuff secret confers great power.

It is possible to get four or five simple tools from the vast body of statistical junk which will provide most of the power needed for any research. Many people have gotten doctorates without knowing (or at least needing) much more than the four tools I present here.

The Mean

The most important single number is the mean. This one number condenses all the data into one number which represents it. If I have ten measurements, I obtain the mean by adding all the measurements together and dividing by ten ( the number of measurements). If all the numbers were 5, adding 5 ten times gives 50 and dividing by 10 gives 5 (which is what we would expect). If the numbers vary all over the place, the mean gives the number which best represents them the whole group. Use the function average() in Excel.

The Standard Deviation

If all the numbers are 5, a mean of 5 is a perfect representation of the whole group. If they vary all over the lot, but the mean is 5, the representation os less than perfect. Subtracting one of the nunbers from the mean gives us an idea of how much that number varies from the mean this is the deviation.

To Get an idea of how much all the numbers vary, we can get all the deviations and average them.

It is not possible to average all the deviations by adding them and dividing by the number, because some will be positive numbers and some will be negative. If they cancelled, a set of wild deviations could result in an average deviation of zero.

To get a true average, we square all the deviations to make them all positive, divide by the number of samples, and take the square root. Or just use stdev() in Excel.

The Correlation Coefficient

This tool has the most impressive sounding name, though it is really very simple in concept. We usually need be answering questions about the relation between two variables.

Say we have the ages and the grades for all the people in a university and we want to know if older people get better grades--if age and grades are correlated.

If we subtract the mean of the ages from the person's age, we get a number which is either negative, positive, or zero. if we subtract the mean of the grade from the person's grade, we get another negative, positive or zero number. If the two numbers are positive or negative, the grade and the age are directly correlated. If one is positive and one negative, the grade and the age are inversely correlated. Multiplying the two numbers will give a direct or inverse correlation.

Taking the mean of all the multiplied numbers will yield a number which is positive if most are directly correlated, negative if most are inversely correlated or zero if the differences cancel there is not correlation.

Since the resultant mean is a product of two deviations, it is possible to scale the result by dividing by the product of the two standard deviations. THe easy way to do this is to use Correl() in Excel.

The Confidence Interval

This has a reassuring name, since most people have no confidence in statistics.

The confidence interval is the group of numbers in which we have confidence we can find the real mean. It is the spread of values around the mean where it could be. It is often used to test the usefulness of a sample of the data where a sample is used instead of the whole population.

The confidence interval is always a good thing to offer in any study, but its main use is in determining the usefulness of a particual sample size. If the number of samples is small, the confidence interval widens and the usefulness of the study is less. This one simple fact is often ignored by the press and provides the main way statistics can lie. A big sample gives more confidence (though the study may still be asking the wrong questions).

There are various ways to measure confidence interval, but the simplest is about all anyone needs to use. THe 95% confidence interval is approximately two times the standard deviation divided by the square root of the number of values. The 99% confidence interval is about 2.6 times the standard deviation divided by the square root of the number of values. The easy way to do this is to use CONFIDENCE() in excel ( a 95% confidence interval with a standard deviation of 2.5 and 50 values is CONFIDENCE(.05,2.5,50). The confidence interval is taken by adding to and subtracting from the mean.

Using the Four simple tools

These four tools give incredible power. The mean and standard deviation are the most essential, the correlation coefficient and the confidence interval add most of the rest of the needed abilities.

Mean and Standard Deviation

The mean and standard deviation are the foundation. Knowing the average collapses many numbers into one. Knowing the standard deviation gives a good idea of how good the mean is. These two activities are the heart of statistics.

The mean and standard deviation can be used to derive most of the rest of statistics. In particular, the other two of the simple tools.

The press usually publishes the mean and ignores the standard deviation. This is one easy way to lie with statistics.

The press also ignores the mean and goes for the heart wrenching individual case. The murder rate declines, yet the press maintains the illusion of an increase by sticking with stories instead of reporting the numbers.

The mean is easy to understand and use. The standard deviation will tame most abuses of it.

Correlation Coefficient

Any time two variables need be compared, use the correlation coefficient.

To determine a test's reliablity, divide the questions into 2 groups, give both tests and get the correlation coefficient of the two scores (which is a direct measurement of the reliability of the test).

To see if poverty causes crime, use the correlation coefficient between crime rate and income values obtained by survey.

In a large group of different variables, do a correlation coefficient of each one with each other one to see which may be related. This gives an idea of where to do further research.

To see if a pill is effective in a double blind study, get the correlation coefficient of the dosage to the patient's percieved level of heath.

Confidence Interval

The confidence interval (often called the "CI") is really a fancier standard deviation which takes into account the sample size.

It makes sense that the sample size should affect how good statistics are. If I make a study with only 2 people, it can't possibly be as good as a study with 1000. The confidence interval gives me a handle on how large a sample is needed to make sensible conclusions from the data.

If my standard deviation is 10, 100 samples gives a CI of 2, 1000 samples a CI of .63, 10000 samples a CI of .2. Going from 100 samples to 1000 samples gives a 68% change in confidence. To get another 68% requires getting 9000 more samples. That's why 1000 samples is often considered enough.

In a scientific study, where I have competing hypotheses (each with its own prediction of the mean), I can exclude those hypotheses whose means lie outside the confidence interval.

It is really easy to make a study with a small sample size, and discard the studies until you get the result you want. Since the press reports only the bottom line, many vested interests use this technique to mislead with statistics. Being requred to report confidence intervals would make this almost impossible

If some one spouts a questionable statistic, ask for the CI.