November 19, 2013 --- Class 25 --- Chi-square Distribution
Activities:
Chi-squared Distribution
------------------------
In the last class, the case of fitting data using
a linear model was considered in great detail. We
obtained specific formulae for the slope and intercept parameters of
the fit. However, even with a formula for the slope and intercept,
there are two important questions:
1) How large is the error in the fitted parameters?
2) How good is the fit?
Using a standard propagation of errors approach, we found the error
on the slope. At the beginning of the class I reminded everyone
of the error in the slope and then gave the formulae for the
intercept, and the covariance of slope and intercept.
We next considered the fact that the quantity \chi^2 can be used to
determine the goodness of fit. Since \chi^2 should be the sum of
the squares of N normalized Gaussian variables, it has a known
distribution. If we find that our fit gives a small value of
\chi^2, that is good. If we get a large value, maybe our model is
not the correct description of the data. If the model has p
parameters that we adjust, the fit is said to have N-p degrees of
freedom. Soon we will look at a code to compute
the probabily that chisq is greater than the observed value.
This is called the confidence level of the fit. If your
confidence level gets below 5-10%, you begin to wonder if you
really have the right model for describing your data.
To get a feeling for what the chisq distribution looks like I asked
class members to consult with a neighbor and think about the
tools we have and how then can be used to explore the chi-square
distribution. One approach, championed by Xiao Fu, is to get
Mathematica to produce a list of Gaussian random numbers, square them,
take the total and repeat many times to make a histogram. We can
easily make a list:
v= RandomReal[NormalDistribution[],{10}]
make a list of ten numbers
Total[v*v]
gives us the sum of the squares. Defining two functions makes our
life easy:
list[n_Integer] := RandomReal[NormalDistribution[],{n}]
makedist[dof_Integer] := Table[v=list[dof]; Total[v*v], {100000}]
Now, you may just use Histogram[makedist[10]] to make a historgram
for the chi-square distribution with ten degrees of freedom.
In an alternative approach, we took
a code that prints out gaussian random numbers and created an awk
script to sum squares of N such numbers. The code for creating
gaussian random numbers is in ~sg/src/misc/gausrand.c. gausrand.c uses
an accept-reject algorithm to generate numbers uniformly in the
unit circle and then rescales the radius of the points to create
a gaussian distribution. Like the Box-Muller method, it produces
two numbers at a time. It comes from Sec. 7.2 of Numerical Recipes.
testgaus.c is a simple driver programs that calls gausrand 250000
times and prints the result to standard output.
The awk script chisq.a looks like this:
{sum += $1*$1; if (NR%n == 0) {print sum; sum=0.} }
testgaus |awk -f chisq.a n=6
will produce a stream of numbers according to the chisq distribution
with N=6. Of course, you can pipe this to hist or autocorr if you
want to see a histogram or verify that the autocorrelation is small.
Interactively, I wrote a bash for loop to look at values of n from 2 to
30 and display a graph on each case.
#!/bin/bash
for i in {2..30}
do
~/bin/testgaus |awk -f chisq.a n=$i|hist -n 50 -g |axis lt $i |plot -Tx
done
Note how the shell variable is used to make a title on the graph.
Analytic Approach to Chi-square Distribution
---------------------------------------------
We showed how the chi square distribution can be found by
integrating random variables in spherical coordinates. If we integrate
all the variables but the radial one, we find that the radial variable
is equivalent to the random variable chi-squared. In fact, we don't
even have to do the angular integral because it contributes to the
normalization and we can just normalize our probability distribution
function so that when we integrate it from 0 to infinity, we get one.
However, we derived the angular integral which is the surface area
of an n-dimensional sphere by completing the integration over the
radial variable and using the fact that we started with the
normalized integral.