... but also always make sure to interpret results correctly! This post presents a quick intro on how
        to perform statistical hypothesis testing and power analysis with the Accord.NET Framework in C#.
            
        What does hypothesis testing means in statistics (and should also mean everywhere, for that matter)?
        You may recall from Karl Popper's theory of falsiability that good theories can rarely be accurately
        proven, but you may gain a considerable confidence on them by constantly challenging and failing to
        refute them.
    
    
        This comes from the fact that it is often easier to falsify something than to prove it. Consider
        for instance the white-swan/black-swan example: Let's say a theory states that all swans are white.
        This is a very strong claim; it does not applies to one or a few particular observations of a swan,
        but all of them. It would be rather difficult to verify if all of the swans in Earth
        are indeed white. It is thus almost impossible to prove this theory directly.
    
        However, the catch is that it will take only a single contrary example to refute it. If
        we find a single swan that is black, the entire theory should be rejected, so alternate theories could
        be raised. It should be fairly easy to attempt to prove a theory wrong. If a theory continuously survives
        those attempts to be proven wrong, it becomes stronger. This does not necessarily means it is correct,
        only that it is very unlikely to be wrong.
    
        This is pretty much how the science method works; it also provides a solution to the demarcation problem
        originally proposed by Kant: the problem of separating the sciences from the pseudo-sciences (i.e. astronomy
        from astrology). A "good" theory should be easy to attack so we can try to refute it; and by constantly
        challenging it and failing to prove it wrong, we gain further confidence that this theory may,
        indeed, be right. In short:
    
        Often the most interesting theories can't be proven right, they can only be proven wrong. By continuously
            refuting alternatives, a theory becomes stronger (but most likely never reaching the 'truth').
    
    
        Answering the question in the first phrase of this section, hypothesis testing means verifying if a
        theory holds even when confronted with alternative theories. In statistical hypothesis testing, this
        often means checking if a hypothesis holds even when confronted with the fact that it may have just
            happened to be true by pure chance or plain luck.
        
        Fisher (1925) also noted that we can't
        always prove a theory but we can attempt to refute it. Therefore, statistical hypothesis testing includes
        stating a hypothesis, which is the hypothesis we are trying to invalidade; and check if we can confidently
        reject it by confronting it with data from a test or experiment. This hypothesis to be pickpocketed
        is often called the null hypothesis (commonly denoted H0). It receives this
        name as it is usually the hypothesis of no change: there is no difference, nothing changed after the
        experiment, there is no effect.
    
        The hypotheses verified by statistical hypothesis tests are often theories about whether or not a random
         sample from a population comes from a given probability distribution. This seems weird, but several
        problems can be cast in this way. Suppose, for example, we would like to determine if students from
        a classroom have significantly different grades than students from another room. Any difference could
        possibly be attributed to chance, as some students may just perform better on a exam because of luck.
    
        An exam was applied to both classrooms. The exam results (by each student) are written below:
           
            
                
                    Classroom A
 
 | 
                    Classroom B
 
 | 
                
                
                    8.12, 8.34, 7.54,
 
 8.98, 8.24, 7.15,
 
 6.60, 7.84, 8.68,
 
 9.44, 8.83, 8.21,
 
 8.83, 10.0, 7.94,
 
 9.58, 9.44, 8.36,
 
 8.48, 8.47, 8.02,
 
 8.20, 10.0, 8.66,
 
 8.48, 9.17, 6.54,
 
 7.50
 
 | 
                    7.50, 6.70, 8.55,
 
 7.84, 9.23, 6.10,
 
 8.45, 8.27, 7.01,
 
 7.18, 9.05, 8.18,
 
 7.70, 7.93, 8.20,
 
 8.19, 7.65, 9.25,
 
 8.71, 8.34, 7.47,
 
 7.47, 8.24, 7.10,
 
 7.87, 10.0, 8.26,
 
 6.82, 7.53
 
 | 
                
                
                    Students: 28
 
 Mean: 8.416
 
 | 
                    Students: 29
 
 Mean: 7.958
 
 | 
                
            
        
       
        We have two hypothesis:
    
        - Results for classroom A are not significantly different from the results from classroom B. Any difference
 in means could have been explained due to chance alone.
        - Results are indeed different. The apparent differences are very unlikely to have occurred by chance.
    
        Since we have less than 30 samples, we will be using a 
            Two-Sample T-Test to test the hypothesis that the population's mean values of the two samples
        are not equal. Besides, we will not be assuming equal variances. So let's we create our test object:
    
    
        using Accord.Statistics;
        using Accord.Statistics.Testing;
        using Accord.Statistics.Testing.Power;
        
            
            TwoSampleTTest test = new TwoSampleTTest(A, B,
                  hypothesis: TwoSampleHypothesis.ValuesAreNotEqual);
    
        And now we can query it:
    
        Console.WriteLine("Significant: "
        + test.Significant); // true; the test is significant
    
        Which reveals the test is indeed significant. And now we have at least two problems to address...
    
        Problem 1: Statistical significance does not imply practical significance
    
        So the test was significant. But would this mean the difference itself is significant?
        Would this mean there any serious problem with the school teaching method?
    
        No - but it doesn't mean the contrary either. It is impossible to tell just by looking at the p-level.
    
        The test only said there was a difference, but it can not tell the importance of this difference.
        Besides the two classes having performed so differently they could trigger statistical significance,
        we don't know if this difference has any practical significance. A statistical test being significant
        is not a proof; it is just an evidence to be balanced together with other pieces of information in order
        to drawn a conclusion.
    
        Perhaps one of best examples illustrating this problem 
            is given by Martha K. Smith:
    
        Suppose a large clinical trial is carried out to compare a new medical treatment with a standard one.
        The statistical analysis shows a statistically significant difference in lifespan when using the new
        treatment compared to the old one. But the increase in lifespan is at most three days, with average
        increase less than 24 hours, and with poor quality of life during the period of extended life. Most
        people would not consider the improvement practically significant.
    
    
        In our classroom example, the difference in means is about 0.46 points. If principals believe a difference
        of less than 0.50 in a scale from 0.00 to 10.00 is not that critical, there may be no need to force
        students from the room with lower grades to start taking extra lessons after school. In other words,
        statistical hypothesis testing does not lead to automatic decision making. A statistically significant
        test is just another evidence which should be balanced with other clues in order to take a decision or
            draw a conclusion.
    
        Problem 2: Powerless tests can be misleading
    
        The p-level reported by the significance test is the probability of the extreme data we found
        be occurring given the null hypothesis is correct. Often, this is not the case. We
        must also know the probability that the test will reject the null hypothesis when the null hypothesis
            is false. To do so, we must compute the power of our test, and, better yet,
        we should have used this information to conclude how many samples we would need to achieve a more informative
        test before we conducted our experiment. The power is then the probabability of detecting
        a difference if this difference indeed exists (Smith,
            2011). So let's see:
    
        // Check the actual power of the test...
        Console.WriteLine("Power: "
        + test.Analysis.Power); // gives about 0.50
    
        The test we performed had astonishingly small power; so if the null hypothesis is false (and there is
        actually a difference between the classrooms) we have only about 50% chance of correctly rejecting it.
        Therefore, this would also lead to a 50% chance of producing a false negative - incorrectly saying there
        is no difference when actually there is.  The table below exemplifies the different errors we can
        get by rejecting or not the null hypothesis.
    
        
            
 | 
            Null hypothesis is true
 
 | 
            Null hypothesis is false
 
 | 
        
|---|
        
            Fail to reject
 
 
 the null hypothesis
 
 | 
            Correct
 
 True negative
 
 | 
            Type II error (beta)
 
 False negative
 
 | 
        
|---|
        
            Reject the
 
 
 null hypothesis
 
 | 
            Type I error (alpha)
 
 False positive
 
 | 
            Correct outcome
 
 True positive
 
 | 
        
|---|
    
    
        Tests with little statistical power are 
            often inconsistent in the literature. Suppose, for example, that the score from the first student
        from classroom B had earned a  7.52 instead of 7.50. Due to the low power of the test, this little
        change would already be sufficient to render the test nonsignificant, and we will not be able to reject
        the null hypothesis that the two population means aren't different anymore. Due to the low power
        of the test, we can't distinguish between a correct true negative and a type II error. This is why
        powerless tests can be misleading and should never be relied upon for decision making (Smith,
            2011b).
    
        The power of a test increases with the sample size. To obtain a power of at least 80%, let's see
        how many samples should have been collected:
    
        // Create a post-hoc analysis to determine sample size
            var analysis = new TwoSampleTTestPowerAnalysis(test);
        analysis.Power = 0.80;
        analysis.ComputeSamples();
        
        
        Console.WriteLine("Samples in each group:
            "oup: " + Math.Ceiling(analysis.Samples1));
            // gives 57
    
    
        So, we would actually need 57 students in each classroom to confidently affirm whether there
        was an difference or not. Pretty disappointing, since in the real world we wouldn't be able to enroll
        more students and wait years until we could perform another exam just to adjust the sample size. On
        those situations, the power for the test can be increased by increasing the significance threshold (            href="http://otg.downstate.edu/downloads/2007/Spring07/thomas.pdf">Thomas, Juanes, 1996), although
        clearly sacrificing our ability to detect false positives (Type I errors) in the process.
        
        The short answer is 'only if you have enough power'. Otherwise, definitively no.
    
        If you have reason to believe the test you performed had enough power to detect a difference within
        the given Type-II error rate, and it didn't, then accepting the null would most likely be acceptable.
        The acceptance should also be accompanied of an analysis of confidence intervals or effect sizes. Consider,
        for example, that some actual scientific discoveries were 
            indeed made by accepting the null hypothesis rather than by contradicting it; one notable example
        being the discovery of the X-Ray (Yu,
            2012).
        
        Much of the criticism associated with statistical hypothesis testing is often related not to the use
        of statistical hypothesis testing per se, but on how the significance outcomes from such tests are interpreted.
        Often it boils down to incorrectly believing a p-value is the probability of a null hypothesis being
        true, when, in fact, it is only the probability of obtaining a test statistic as extreme as the one
        calculated from the data, always within the assumptions of the test under question.
    
        Moreover, another problem may arise when we chose a null hypothesis which is obviously false. There
        is no point on hypothesis testing when the null hypothesis couldn't possibly be true. For instance,
        it is very difficult to believe a parameter of a continuous distribution is exactly equal to
        an hypothesized value, such as zero. Given enough samples, it will always be possible to find a difference,
        as small as it gets, and the test will invariably turn significant. Under those specific circunstances,
        statistical testing can be useless as it has relationship to practical significance. That is why analyzing
        the effect size is important in order to determine the practical significance of an hypothesis test.
        Useful hypothesis would also need to be probable, plausible and falsifiable (Beaulieu-Prévost,
            2005).
    
        The following links also summarize much of the criticism in statistical hypothesis testing. The last
        one includes very interesting (if not enlightening) comments (in the comment section) on common criticisms
        of the hypothesis testing method.
            
        Now that we presented the statistical hypothesis testing framework, and now that the reader is aware
        of its drawbacks, we can start talking about performing those tests with the aid of a computer. The
        Accord.NET Framework is a framework for scientific computing with wide support for statistical and power
        analysis tests, without entering the merit if they are valid or no. In short, it provides some scissors;
            feel free to run with them.
        
        As it may already have been noticed, the sample code included in the previous section was C# code using
        the framework. In the aforementioned example, we created a T-Test for comparing the population means
        of two samples drawn from Normal distributions. The framework, nevertheless, includes 
            many other tests, some with support for 
                power analysis. Those include:
    
        
            Parametric tests
 
 | 
            Nonparametric tests
 
 | 
        
|---|
        
            
 
 | 
            
 
 
 | 
        
    
    
        Tests marked with a * are available in versions for one and two samples. Tests on the second row can
        be used to test hypothesis about 
            contingency tables. Just remembering, the framework is 
                open source and all code is available on Google Code.
    
        A class diagram for the hypothesis testing module is shown in the picture below. Click for a larger version.
    
         
 
        Class diagram for the 
            Accord.Statistics.Testing namespace.
    
        Framework usage should be rather simple. In order to illustrate it, the next section brings some example
        problems and their solution using the hypothesis testing module.
    
        Example problems and solutions 
    
        Problem 1. Clairvoyant card game.
    
        This is the second example from Wikipedia's page on hypothesis testing. In this example, a person
        is tested for clairvoyance (ability of gaining information about something through extra sensory perception;
        detecting something without using the known human senses.
        
    // A person is shown the reverse of a playing card 25 times and is
    // asked which of the four suits the card belongs to. Every time
    // the person correctly guesses the suit of the card, we count this
    // result as a correct answer. Let's suppose the person obtained 13
    // correctly answers out of the 25 cards.
    // Since each suit appears 1/4 of the time in the card deck, we 
    // would assume the probability of producing a correct answer by
    // chance alone would be of 1/4.
    // And finally, we must consider we are interested in which the
    // subject performs better than what would expected by chance. 
    // In other words, that the person's probability of predicting
    // a card is higher than the chance hypothesized value of 1/4.
    BinomialTest test = new BinomialTest(
        successes: 13, trials: 25,
        hypothesizedProbability: 1.0 / 4.0,
        alternate: OneSampleHypothesis.ValueIsGreaterThanHypothesis);
    Console.WriteLine("Test p-Value: " + test.PValue);     // ~ 0.003
    Console.WriteLine("Significant? " + test.Significant); // True.
            
            Problem 2. Worried insurance company
        
            This is a common example with variations given by many sources. Some of them can be found 
                here and here.
                
    // An insurance company is reviewing its current policy rates. When the
    // company initially set its rates, they believed the average claim amount
    // was about $1,800. Now that the company is already up and running, the
    // executive directors want to know whether the mean is greater than $1,800.
    double hypothesizedMean = 1800;
    // Now we have two hypothesis. The null hypothesis (H0) is that there is no
    // difference between the initial set value of $1,800 and the average claim
    // amount of the population. The alternate hypothesis is that the average
    // is greater than $1,800.
    // H0 : population mean ≤ $1,800
    // H1 : population mean > $1,800
    OneSampleHypothesis alternate = OneSampleHypothesis.ValueIsGreaterThanHypothesis;
    // To verify those hypothesis, we draw 40 random claims and obtain a
    // sample mean of $1,950. The standard deviation of claims is assumed
    // to be around $500.
    double sampleMean = 1950;
    double standardDev = 500;
    int sampleSize = 40;
            
    // Let's create our test and check the results
    ZTest test = new ZTest(sampleMean, standardDev,
        sampleSize, hypothesizedMean, alternate);
    Console.WriteLine("Test p-Value: " + test.PValue);      // ~0.03
    Console.WriteLine("Significant? " + test.Significant); // True.
    // In case we would like more information about what was calculated:
    Console.WriteLine("z statistic: " + test.Statistic);     // ~1.89736
    Console.WriteLine("std. error: " + test.StandardError); // 79.05694
    Console.WriteLine("test tail: " + test.Tail); // one Upper (right)
    
    Console.WriteLine("alpha level: " + test.Size); // 0.05
        
            Problem 3. Differences among multiple groups (ANOVA)
        
            This example comes from Wikipedia's page on the F-test. Suppose we would like to study the effect of
            three different levels of a factor ona response (such as, for example, three levels of a fertilizer
            on plant growth. We have made 6 observations for each of the three levels a1, a2 and a3,
            and have written the results as in the table below.
                
    double[][] outcomes = new double[,]
    {
        // a1 a2 a3
        {  6,    8,  13 },
        {  8,   12,   9 },
        {  4,    9,  11 },
        {  5,   11,   8 },
        {  3,    6,   7 },
        {  4,    8,  12 },
    }
    .Transpose().ToArray();
    // Now we can create an ANOVA for the outcomes
    OneWayAnova anova = new OneWayAnova(outcomes);
    // Retrieve the F-test
    FTest test = anova.FTest;
    Console.WriteLine("Test p-value: " + test.PValue);   // ~0.002
    Console.WriteLine("Significant? " + test.Significant); // true
    // Show the ANOVA table
    DataGridBox.Show(anova.Table);
        
            The last line in the example shows the ANOVA table using the framework's DataGridBox object. The DataGridBox
            is a convenience class for displaying DataGridViews just as one would display a message using MessageBox.
            The table is shown below:
        
        
            width="600px" />        
             Problem 4. Biased bees
        
            This example comes from the stats page of the College of Saint Benedict and Saint John's University
            (Kirkman, 1996). It is a very interesting example
            as it shows a case in which a t-test fails to see a difference between the samples because of the non-normality
            of of the sample's distributions. The Kolmogorov-Smirnov nonparametric test, on the other hand,
            succeeds.
        
            The example deals with the preference of bees between two nearby blooming trees in an empty field. The
            experimenter has colelcted data measurinbg how much time does a bee spents near a particular tree. The
            time starts to be measured when a bee first touches the tree, and is stopped when the bee moves more
            than 1 meter far from it. The samples below represents the measured time, in seconds, of the observed
            bees for each of the trees.
                
    double[] redwell = 
    {
        23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3,
        14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 
        24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 
        24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3,
        1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 
        22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1,
        19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5 
    };
    double[] whitney = 
    {
        16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2,
        23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5,
        14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2,
        22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1,
        6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4,
        23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6,
        39.1, 26.5, 22.7
    };
    // Create a t-test as a first attempt.
    var t = new TwoSampleTTest(redwell, whitney);
    Console.WriteLine("T-Test");
    Console.WriteLine("Test p-value: " + t.PValue);    // ~0.837
    Console.WriteLine("Significant? " + t.Significant); // false
    // Create a non-parametric Kolmogovor Smirnov test
    var ks = new TwoSampleKolmogorovSmirnovTest(redwell, whitney);
    Console.WriteLine("KS-Test");
    Console.WriteLine("Test p-value: " + ks.PValue);    // ~0.038
    Console.WriteLine("Significant? " + ks.Significant); // true
        
            Problem 5. Comparing classifier performances
        
            The last example comes from (E.
                Ientilucci, 2006) and deals with comparing the performance of two different raters (classifiers)
            to see if their performance are significantly different.
        
             Suppose an experimenter has two classification systems, both trained to classify observations
            into one of 4 mutually exclusive categories. In order to measure the performance of each classifier,
            the experimenter confronted their classification labels with the ground truth for a testing dataset,
            writing the respective results in the form of contingency tables.
        
            The hypothesis to be tested is that the performance of the two classifiers are the same.
                             
    // Create the confusion matrix for the first sytem.
    var a = new GeneralConfusionMatrix(new         class="kwrd">int[,]]]
    {
        { 317,  23,  0,  0 },
        {  61, 120,  0,  0 },
        {   2,   4, 60,  0 },
        {  35,  29,  0,  8 },
    });
    // Create the confusion matrix for the second system.
    var  b = new GeneralConfusionMatrix(new         class="kwrd">int[,]
    {
        { 377,  79,  0,  0 },
        {   2,  72,  0,  0 },
        {  33,   5, 60,  0 },
        {   3,  20,  0,  8 },
    });
    var test = new TwoMatrixKappaTest(a, b);
    Console.WriteLine("Test p-value: " + test.PValue);    // ~0.628
    Console.WriteLine("Significant? " + test.Significant); // false
    
        
            In this case, 
            the test didn't show enough evidence to confidently reject the null hypothesis. Therefore, one 
            should restrain from affirming anything about differences between the two systems, unless the power 
            for the test is known.
        
            Unfortunately I could not find a clear indication in the literature about the power of a two matrix Kappa test. However,
            since the test statistic is asymptotically normal, one would try checking the power for this test by
            analysis the power of the underlying Z-test. If there is enough power, one could possibly accept the
            null hypothesis that there are no large differences between the two systems.
                
            As always, I expect the above discussion and examples could be useful for interested readers and 
            users. However, if you believe you have found a flaw or would like to discuss any portion of this 
            post, please feel free to do so by posting on the comments section.
 
            PS: The classroom example uses a T-test to test for differences in populations means. The T-Test assumes a normal distribution. The data, however, it not exactly normal, since it is crippled between 0 and 10. Suggestions for a better example would also be appreciated!
                
            R. A. Fisher, 1925. Statistical Methods for Research Workers. Available online from: 
                http://psychclassics.yorku.ca/Fisher/Methods/ 
        
        
            M. K. Smith, 2011. Common mistakes in using statistics: Spotting and Avoiding Them - Power of a
            Statistical Procedure. Available online from: 
                http://www.ma.utexas.edu/users/mks/statmistakes/power.html
        
            M. K. Smith, 2011b. Common mistakes in using statistics: Spotting and Avoiding Them - Detrimental
            Effects of Underpowered or Overpowered Studies. Available online from: 
                http://www.ma.utexas.edu/users/mks/statmistakes/UnderOverPower.html
        
            L. Thomas, F. Juanes, 1996. The importance of statistical power analysis: an example from animal behaviour,
            Animal Behaviour, Volume 52, Issue 4, October., Pages 856-859. Available online from: http://otg.downstate.edu/downloads/2007/Spring07/thomas.pdf
        
        
            C. H. (Alex) Yu, 2012. Don't believe in the null hypothesis? Available online from: 
                http://www.creative-wisdom.com/computer/sas/hypothesis.html
        
        
            D. Beaulieu-Prévost, 2005. Statistical decision and falsification in science : going beyond the null
            hypothesis. Séminaires CIC. Université de Montréal.
        
            T.W. Kirkman, 1996. Statistics to Use. Acessed July 2012. Avalilable online from 
                http://www.physics.csbsju.edu/stats/
        
            E. Ientilucci, 2006. "On Using and Computing the Kappa Statistic". Available online from http://www.cis.rit.edu/~ejipci/Reports/On_Using_and_Computing_the_Kappa_Statistic.pdf