November 20, 2014 --- Class 26 --- Fitting Data, Chi-square Distribution
Activities:
Fitting Data
------------
We looked at what one should do to fit data to a model function.
If the data has a Gaussian distribution about its true value, then
we maximize the probability for the set of observations by
adjusting the parameters of the model to minimize \chi^2.
\chi^2= \Sum_{i=0}^N} (y_i- y_model(x_i)^2/ \sigma_i^2
Linear Model Function
---------------------
The case of a linear model was considered in great detail. We
obtained specific formulae for the slope and intercept parameters of
the fit. However, even with a formula for the slope and intercept,
there are two important questions:
1) How large is the error in the fitted parameters?
2) How good is the fit?
Using a standard propagation of errors approach, we found the error
on the slope. and intercept, and the covariance of those quantities.
Only one was worked out in detail.
Chi-squared Distribution
------------------------
In consideration of our second question we note that \chi^2 can be used to
determine the goodness of fit. Since \chi^2 should be the sum of
the squares of N normalized Gaussian variables, it has a known
distribution. If we find that our fit gives a small value of
\chi^2, that is good. If we get a large value, maybe our model is
not the correct description of the data. If the model has p
parameters that we adjust, the fit is said to have N-p degrees of
freedom. Soon we will look at a code to compute
the probabily that chisq is greater than the observed value.
This is called the confidence level of the fit. If your
confidence level gets below 5-10%, you begin to wonder if you
really have the right model for describing your data.
We created variables with a chi-square distribution by starting with
a code that prints out gaussian random numbers and created an awk
script to sum squares of N such numbers. The code for creating
gaussian random numbers is in ~sg/src/misc/gausrand.c. gausrand.c uses
an accept-reject algorithm to generate numbers uniformly in the
unit circle and then rescales the radius of the points to create
a gaussian distribution. Like the Box-Muller method, it produces
two numbers at a time. It comes from Sec. 7.2 of Numerical Recipes.
testgaus.c is a simple driver programs that calls gausrand 2500000
times and prints the result to standard output.
The awk script chisq.a looks like this:
{sum += $1*$1; if (NR%n == 0) {print sum; sum=0.} }
testgaus |awk -f chisq.a n=6
will produce a stream of numbers according to the chisq distribution
with N=6. Of course, you can pipe this to hist or autocorr if you
want to see a histogram or verify that the autocorrelation is small.
Here is a bash script to look at values of n from 2 to
30 and display a graph for each case. Can you convert to a while loop
in csh?
#!/bin/bash
for i in {2..30}
do
~/bin/testgaus |awk -f chisq.a n=$i|hist -n 50 -g |axis lt $i |plot -Tx
done
Note how the shell variable is used to make a title on the graph.
Here is an alternative approach based on Mathematica that we did
not try in class.
v= RandomReal[NormalDistribution[],{10}]
make a list of ten numbers
Total[v*v]
gives us the sum of the squares. Defining two functions makes our
life easy:
list[n_Integer] := RandomReal[NormalDistribution[],{n}]
makedist[dof_Integer] := Table[v=list[dof]; Total[v*v], {100000}]
Now, you may just use Histogram[makedist[10]] to make a histogram
for the chi-square distribution with ten degrees of freedom.
Analytic Approach to Chi-square Distribution
---------------------------------------------
We started to show how the chi square distribution can be found by
integrating random variables in spherical coordinates. If we integrate
all the variables but the radial one, we find that the radial variable
is equivalent to the random variable chi-squared. In fact, we don't
even have to do the angular integral because it contributes to the
normalization and we can just normalize our probability distribution
function so that when we integrate it from 0 to infinity, we get one.
In our next class, we will continue the derivation and
derive the angular integral which is the surface area
of an n-dimensional sphere by completing the integration over the
radial variable and using the fact that we started with a
normalized integral.