April 12, 2012 --- Class 26 --- SPRNG, Fitting Data, Chi-squared Distribution
Activities:
SPRNG
In our last class we briefly looked at SPRNG a set of
Scalable Parallel Pseudo Random Number Generators:
http://sprng.cs.fsu.edu/
These routines can be used in parallel programs that require a different
(independent) stream of random numbers on each node. I tried to compile
the routines from source code. It turns out that there are some
problems with the fortran compiler on our cluster. The issues have
to do with 32-bit vs 64-bit executables. There may be other issues.
Be that as it may, the libsprng.a library is compiles from C++ at
it compiles properly, even if the fortran codes do not. I reran
configure and make. When make failed, I went into the EXAMPLES
subdirectory and typed make again. Although the fortran examples
did not compile, the C++ examples did and I ran some simple codes
that produced random doubles and integers. Here is a listing of all
the example programs:
-rwxr-xr-x 1 sg staff 315704 Apr 12 09:39 convert.tmp
-rwxr-xr-x 1 sg staff 316504 Apr 12 09:39 pi-simple.tmp
-rwxr-xr-x 1 sg staff 311120 Apr 12 09:39 seed-simple.tmp
-rwxr-xr-x 1 sg staff 315696 Apr 12 09:39 seed.tmp
-rwxr-xr-x 1 sg staff 311048 Apr 12 09:39 simple-simple.tmp
-rwxr-xr-x 1 sg staff 210192 Apr 12 09:39 spawn.tmp
-rwxr-xr-x 1 sg staff 311504 Apr 12 09:39 sprng-simple.tmp
-rwxr-xr-x 1 sg staff 210192 Apr 12 09:39 sprng.tmp
Fitting Data
We looked at what one should do to fit data to a model function.
If the data has a Gaussian distribution about its true value, then
we maximize the probability for the set of observations by
adjusting the parameters of the model to minimize \chi^2.
\chi^2= \Sum_{i=0}^N} (y_i- y_model(x_i)^2/ \sigma_i^2
Linear Model Function
The case of a linear model was considered in great detail. We
obtained specific formulae for the slope and intercept parameters of
the fit. However, even with a formula for the slope and intercept,
there are two important questions:
1) How large is the error in the fitted parameters?
2) How good is the fit?
Using a standard propagation of errors approach, we found the error
on the slope and intercept, and the covariance of those quantities.
We also considered the fact that the quantity \chi^2 can be used to
determine the goodness of fit. Since \chi^2 should be the sum of
the squares of N normalized Gaussian variables, it has a known
distribution. If we find that our fit gives a small value of
\chi^2, that is good. If we get a large value, maybe our model is
not the correct description of the data. If the model has p
parameters that we adjust, the fit is said to have N-p degrees of
freedom. Soon we will look at a code to compute
the probabily that chisq is greater than the observed value.
This is called the confidence level of the fit. If your
confidence level gets below 5-10%, you begin to wonder if you
really have the right model for describing your data.
To get a feeling for what the chisq distribution looks like we took
a code that prints out gaussian random numbers and created an awk
script to sum squares of N such numbers. The code for creating
gaussian random numbers is in ~sg/src/misc/gausrand.c. gausrand.c uses
an accept-reject algorithm to generate numbers uniformly in the
unit circle and then rescales the radius of the points to create
a gaussian distribution. Like the Box-Muller method, it produces
two numbers at a time. It comes from Sec. 7.2 of Numerical Recipes.
testgaus.c is a simple driver programs that calls gausrand 250000
times and prints the result to standard output.
The awk script chisq.a looks like this:
{sum += $1*$1; if (NR%n == 0) {print sum; sum=0.} }
testgaus |awk -f chisq.a n=6
will produce a stream of numbers according to the chisq distribution
with N=6. Of course, you can pipe this to hist or autocorr if you
want to see a histogram or verify that the autocorrelation is small.
Members of the class were assigned various values of the number of
degrees of freedom and created a histogram of the chi-sq distribution.