October 11, 2007 --- Class 13 --- Review of Biased Estimators, Hist, Application of the Jackknife Method Activities: Biased Esimators Using Mathematica We quickly reviewed the example of biased estimators in the Mathematica handout. We made use of the Mathematica notebook ~sg/October_10_2006.nb on the Apple cluster in Swain. We found that for small samples sizes, the integrals, which represent the average value for 1/(mean value) we get numbers larger than 2, even though for an infinite sample we expect 2. We plotted the values obtained for different samples sizes vs 1/(sample size) and found that the curve was linearly approaching 2 as 1/(sample size) approaches zero. Bias reduction Recognizing that a plot of the result vs. 1/n shows the error is linear in 1/n, where n is the sample size: S_n = A +e/n where A is the limit of n -> infinity and e determines the size of the error. Considering samples sizes n and n-1, it is easy to solve for A. A = n S_n - (n-1) S_{n-1} In the next class, we will apply this idea to bias reduction by looking at different sample sizes. The jackknife method allows us to resample our n values to also get samples of size (n-1). Biased Estimators and the Jackknife method A program ~sg/jackknife/jackknife.c implements the jackknife method of bias reduction. The jackknife program can take up to two arguments. The first is the sample size, the second, optional argument, is the number of samples to create. On each line of output is the sample estimate of 1/(sum of x_i), and the jackknife value (which is corrected for bias). We examined the code an how it implements the formulae displayed on the board. To analyze this method, we pick a bunch of samples sizes to study and generate a large number of samples for each size. To see the difference between the naive estimate and the bias corrected values, two programs hist and blockerr are useful. (Another simple technique is to use axis to create a scatterplot using the m 0 option so the points are not connected. A diagonal line can easily be added to the plot. We see that the points in the scatterplot are mostly on one side of the diagonal. The jackknife values tend to be smaller than the naive value. This is most dramatic for small sample sizes.) hist is a program for making histograms and may be found in ~sg/src for the source (hist.c) and the executable is in bin. The options for hist are described in the comments at the top of hist.c: /* Make histogram of a list of data * * option -n specifies number of bins * * option -s specifies size of bin * * option -x specifies lower and upper limits of histogram * * option -g specifies output in form suitable for "graph" * * or "axis" * * if -g 1 we get a full bar for each bin * * if -g 2 the bars don't go down to the origin * * option -m specifies the line type for use with "axis" * hist takes it input from the stardard input. It will also take 10 bins by default. So, if you have file of number called file, hist samples_$i echo -n Naive $i " " >>! results awk '{print $1}' samples_$i | blockerr 1 |awk '{print $8, $10}' >>results echo -n Jackknife $i " " >>! results awk '{print $2}' samples_$i | blockerr 1 |awk '{print $8, $10}' >>results end The results file can easily be displayed with the command: awk '{print $2,$3,$4}' results |axis e |xplot The upper set of values is from the naive estimate, and the lower is the jackknife corrected values. One immediately sees that the jackknife values much more quickly approach 2, the asymptotic value. The file ~sg/jackknife/jackknife.ax on Swain cluster, was prepared before class, and includes both the above results and the analytic results derived in the last class. The crosses are from doing the integral numerically with Mathematica. One can see excellent agreement between the statistical approach here and the analytic results. The following supplementary material was not covered in class. Other approaches to removing the sample bias include using the basic idea behind the jackknife bias reduction of the Mathematica generated results, and applying a "second order" jacknife technique on the jackknifed results to remove the 1/n^2 errors. (Assuming there are error terms e/n + e'/n^2, derive a formula involving S_n, S_(n-1) and S_(n-2) that removes both error terms to get a second order bias reduced result.) Applying the jackknife to the Mathematica results works quite nicely as you can see by the following command: cat jackknife.ax j.math.ax |axis |xplot The lower crosses are the results from Mathematica. The commands for doing the analysis can be found in ~sg/jackknife/bias2.math