April 10, 2013 --- Class 26 --- Chi-squared Distribution, Confidence Levels,
Polyfit
Activities:
Chi-squared distribution
------------------------
We briefly reviewed a method of getting numbers with a chi-squared
distribution based on Gaussian random numbers and an awk command:
The code for creating gaussian random numbers is in
~sg/src/gausrand.c.
gausrand.c uses an accept-reject algorithm to generate numbers
uniformly in the unit circle and then rescales the radius of
the points to create a gaussian distribution. Like the
Box-Muller method, it produces two numbers at a time.
It comes from Sec. 7.2 of Numerical Recipes.
testgaus.c is a simple driver programs that calls gausrand 250000
times and prints the result to standard output.
The awk script chisq.a looks like this:
{sum += $1*$1; if (NR%n == 0) {print sum; sum=0.} }
testgaus |awk -f chisq.a n=6
will produce a stream of numbers according to the chisq distribution
with N=6. Of course, you can pipe this to hist or autocorr if you
want to see a histogram or verify that the autocorrelation is small.
One can interactively produced histograms for different values of N:
foreach i ( 2 4 6 8 10 12 20)
testgaus |awk -f chisq.a n=$i|hist -n 50 -g |axis lt $i|xplot
end
Notice the label in the axis command so we can tell which graph
is which.
Analytic Approach to Chi-squared Distribution
---------------------------------------------
We showed how the chi squared distribution can be found by
integrating random variables in spherical coordinates. If we integrate
all the variables but the radial one, we find that the radial variable
is equivalent to the random variable chi-squared. In fact, we don't
even have to do the angular integral because it contributes to the
normalization and we can just normalize our probability distribution
function so that when we integrate it from 0 to infinity, we get one.
However, we derived the angular integral which is the surface area
of an n-dimensional sphere by completing the integration over the
radial variable and using the fact that we started with the
normalized integral.
Here is a Mathematica expression for the chi-squared distribution:
chisq[x_, n_] := (x^((n - 2)/2)*Exp[-(x/2)])/(Gamma[n/2]*2^(n/2))
When making histograms of chi squared or graphs in Mathematica, we find
that the peak in the function occurs where x is about n, the number of
degrees of freedom.
Chi-Squared in C
----------------
If you want to know what a chi-squared distribution looks like you
may use my chisq program in directory ~sg/src/misc.
The program chisq.c calls gausrand() to produce numbers with a
chi-sq distribution. This runs more quickly than using the awk
script discussed above. Of course, writing the C program
takes more time than typing the awk script. So we have yet another
example of trying to compare development time and run time.
To compile the code you need to give the command
cc -o chisq chisq-.c gausrand.c newfsr.c
If you just type chisq, you will get this message:
Usage: chisq dof num
where dof is the number of degrees of freedom
and num is the number of random variables generated
Now you know how to run the code. For example, try:
chisq 10 10000 |~sg/bin/hist -n 40 -g |axis |xplot
will produce a nice graph of the distribution for 10 degrees
of freedom.
It was easy to interactively produce a series of histograms with
the new command.
Try comparing a historgram from a numerical experiment with a
Mathematica plot of the chisq function above.
Confidence Levels
-----------------
We considered the fact that the quantity \chi^2 can be used to
determine the goodness of fit. Since \chi^2 should be the sum of
the squares of N normalized Gaussian variables, it has a known
distribution. If we find that our fit gives a small value of
\chi^2, that is good. If we get a large value, maybe our model is
not the correct description of the data. If the model has p
parameters that we adjust, the fit is said to have N-p degrees of
freedom. A program confidence2, whose source code is in
~sg/src/misc/confidence2_2.c can be used to determine the confidence
level of a fit. The confidence level depends on \chi^2 and the
number of degrees of freedom. The confidence level is the
probability that \chi^2 would exceed the value for which we are
computing the confidence level. As an example:
confidence2
6.3 5
6.300000e+00 5.000000e+00 2.781123e-01
This represents the screen display for considering \chi^2=6.3 for
5 degrees of freedom. The confidence level is 0.278 or nearly 28%.
Thus, for a fit with this confidence level, we would realize that
even for a correct model the computed value of \chi^2 would be
larger than this 28% of the time. Usually, statistician like to
rule things out at the 90% or 95% confidence level. So, when your
confidence level gets below 5-10%, you begin to wonder if you
really have the right model for describing your data. (It is also
possible that you have underestimated the errors.)
BTW, confidence2 takes its input from the standard input, so hit
Control-D or Control-C to terminate its running.
Confidence Level From Mathmatica
--------------------------------
Since we have already considered how to get Mathmematica to calculate
the normalized chisq distribution, we can just use numerical
integration to compute the confidence level:
chisq[x_, n_] := (x^((n - 2)/2)*Exp[-(x/2)])/(Gamma[n/2]*2^(n/2))
confidence[chisquared_, n_] := NIntegrate[ chisq[y,n], {y,chisquared,Infinity}]
Polyfit
-------
Polyfit is a routine that fits data with an nth order polynomial.
You are welcome to use it. You will find the code and a makefile
in ~sg/src/misc. Before class, I made up a code to produce data
close to a quadratic function. It is called almostquad.c and is
set to produce a curve
2.0 + x + 0.1 x*x + noise
The noise is gaussian with a standard deviation sigma = 0.8.
You may run almostquad (it is in ~sg/bin) and put the output
in a file. You may then use polyfit to fit the data.
You can learn quite a bit by doing this once, but I wanted to explore
the statistics of fitting many data sets that only differ by the
random noise. We can ask questions such as:
1) what is the distribution of chi-squared produced by the fit?
2) what is the distribution of confidence level?
3) what is the distribution of each of the fitted parameters and
how does it compare with the error estimated by doing the fit?
In our next class we will explore these issues using
~sg/src/misc/almostquad.csh which is set up to do this
experiment 10000 times.