October 9, 2014 --- Class 13 --- Biased Estimators (continued),
Bias Reduction and the Jackknife Method, Hist, Blockerr
Activities:
Biased Esimators (continued)
We began with a short review of the previous class.
Several Mathematica notebooks are available for your review:
~sg/Documents/feb16_v1.nb , ~sg/Documents/feb16_v2.nb , and
~sg/Documents/October_10_2006.nb .
Also see the file ~sg/mathematica_notes_sg_extended.pdf .
We plotted the values obtained for different samples sizes vs
1/(sample size) and found that the curve was linearly approaching
2 as 1/(sample size) approaches zero.
Bias reduction
Recognizing that a plot of the result vs. 1/n shows the error is
linear in 1/n, where n is the sample size:
S_n = A +e/n
where A is the limit of n -> infinity and e determines the size of
the error. Considering samples sizes n and n-1, it is easy to
solve for A.
A = n S_n - (n-1) S_{n-1}
The jackknife method allows us to resample our n values to also
get samples of size (n-1). If we have a sample of size n, we use
it to compute the statistic S_n. Instead of getting a new sample
with n-1 new variables, we throw one of the values out of our
sample of size n to create a sample of size n-1 and use that to
compute S_{n-1}. However, we can throw any of the values out, so
we might as well take the average over throwing out each of the n
values. Let
S'^j_{n-1}
represent the statistic on the sample of size n-1 from which the jth
element has been eliminated. Our estimate for S_{n-1} is
(1/n) \sum_{j=1}^n S'^j_{n-1}
Biased Estimators and the Jackknife method
We have a choice of which tool too use for applying the jackknife
method. The class chose to implement the method using
Mathematica. Later will will turn to a compliled languague.
Jackknife Method Via Mathematica
Mathematica can be used to generate samples of random numbers and
to apply the jackknife technique. This is a good opportunity to
introduce a new mathematica construct the "module", which is like
a subroutine. In ~sg/jackknife/jackn.math is the code to implement
this method.
First we need a way to create our set of random variables. This is
done via the newvars command:
newvars[num_] := Do[z[i] = Random[], {i, num}]
Once this is defined, you may create as many new random numbers
in the vector z by calling newvars with the number you desire
as the argument.
A Module in mathematica takes two arguments. The first is a list
of variables local to the module, and the second is an expression,
i.e., a sequence of statements.
Using a module statement we define jackknife[x_, num_Integer],
where x is vector of random numbers and num is the sample size.
Here is the code:
jackknife[x_, num_Integer] :=
Module[{mysum, mysumpr, stat, statpr, estimate, fctr, i},
fctr = N[(num - 1)/num]; mysum = Sum[x[i], {i, 1, num}];
stat = num/mysum; estimate = num*stat;
Do[mysumpr = mysum - x[i]; statpr = (num - 1)/mysumpr;
estimate -= fctr*statpr, {i, 1, num}]; {stat, estimate}]
Successive calls to newvars and jackknife are equivalent to what
was done in the C code in the last class. I made a table from many
calls of newvars and jackknife, from which a scatterplot was
produced using ListPlot.
You may consult the Mathematica notebook feb27_jackknife.nb to
follow some of the calculations done in class.
Jackknife Method Via C Programming
A program ~sg/jackknife/jackknife.c implements the jackknife method
of bias reduction. The jackknife program can take up to two
arguments. The first is the sample size, the second, optional
argument, is the number of samples to create. On each line of
output is the sample estimate of 1/(sum of x_i), and the jackknife
value (which is corrected for bias). We examined the code an how
it implements the formulae displayed on the board.
We want to test the idea of bias reduction by doing a numerical
experiment. In the last class, we discussed a program that uses
pseudorandom numbers to simulate the process of measuring the
velocities unformly distributed between zero and one.
For sample size five, I ran the program to repeat the experiment a
few times and we saw that the number varied, I then ran
jackknife 5 100
to produce 100 instances of the experiment with sample size 5. The
class spent some time discussing how we would look at the data.
We know that for infinite sample size, the correct answer is 2, but
that for finite sample size the answer is not even 2 on average. We
also want to know if the bias reduction really works.
[sw246-00:~/jackknife] sg% jackknife 5 10
3.000586 2.913277
2.216128 2.177672
1.885894 1.680463
1.866940 1.760818
1.968507 1.794492
1.777420 1.644192
1.891794 1.675787
1.708738 1.600792
1.493175 1.391724
2.159595 1.920740
Here are ways to analyze the data:
1) Create a scatter plot and see if the 2nd value is closer to 2 on
average.
2) Produce a histogram of the first column and histogram of the 2nd
column. Compare them.
3) Compute the average value of the first column and the average
value of the second column. Compare them.
A scatter plot is very easy with axis, you just need the argument -m 0 .
It was also easy to add a diagonal line and to see that most points
fell below the diagonal, i.e., the first number is usually larger than
the second. Since we know from the integral approach that on average
sample size 5 gives 2.1696, we know the unbiased value is too big.
The fact that the bias reduced values are smaller is encouraging.
I love to make histograms and I have a great program to do that which
you are welcome to copy and use. The program is called hist. Here are
some comments that explain how to use it.
Hist for Making Histograms
hist is a program for making histograms and may be found in
~sg//src/misc
for the source (hist.c) and the executable is in ~sg/bin. The options
for hist are described in the comments at the top of hist.c:
/* Make histogram of a list of data *
* option -n specifies number of bins *
* option -s specifies size of bin *
* option -x specifies lower and upper limits of histogram *
* option -g specifies output in form suitable for "graph" *
* or "axis" *
* if -g 1 we get a full bar for each bin *
* if -g 2 the bars don't go down to the origin *
* option -m specifies the line type for use with "axis" *
hist takes it input from the stardard input. It will also take
10 bins by default. So, if you have file of number called file,
hist samples_$i
echo -n Naive $i " " >>! results
awk '{print $1}' samples_$i | blockerr 1 |awk '{print $8, $10}' >>results
echo -n Jackknife $i " " >>! results
awk '{print $2}' samples_$i | blockerr 1 |awk '{print $8, $10}' >>results
end
This is another example of how we can figure out a useful procedure
interactively and then we find that being able to automate it for
different cases is a great advantage.
The results file can easily be displayed with the command:
awk '{print $2,$3,$4}' results |axis e |xplot
The upper set of values is from the naive estimate, and the lower
is the jackknife corrected values. One immediately sees that the
jackknife values much more quickly approach 2, the asymptotic
value.
The file ~sg/jackknife/jackknife.ax on Swain cluster, was
prepared before class, and includes both the above results
and the analytic results derived in the last
class. The diamonds are from doing the integral numerically with
Mathematica. One can see excellent agreement between the
statistical approach here and the analytic results.
The following supplementary material was not covered in class.
Other approaches to removing the sample bias include using the
basic idea behind the jackknife bias reduction of the Mathematica
generated results, and applying a "second order" jacknife technique
on the jackknifed results to remove the 1/n^2 errors.
(Assuming there are error terms e/n + e'/n^2, derive a formula
involving S_n, S_(n-1) and S_(n-2) that removes both error terms to
get a second order bias reduced result.)
Applying the jackknife to the Mathematica results works
quite nicely as you can see by the following command:
cat jackknife.ax j.math.ax |axis |xplot
The lower crosses are the results from Mathematica. The commands
for doing the analysis can be found in ~sg/oldhome/sg/jackknife/bias2.math