Criticisms of Significance Testing

Kirk 1996 (cited here) http://www.ejournal.aiaer.net/vol21109/4.%20Latchanna%20&%20Masoomeh.pdf

  • Statistical testing is vital for statistical analysis.
  •  But now is the time to incorporate Practical Significance
  •  There are 3 points that need to be adequately met to be considered as in good empirical standing.
  •         o Statistical Significance:
  •                • relies on Sampling Variation
  •                • Shows that the difference in the observed difference of means is most likely not due to chance.
  •                • BUT DOES NOT how much more effective one treatment is than the other..
  •                • This has been forgotten about and simply the importance of gaining a significant result. This has contributed in the ‘File Drawer Effect’
  •          o Practical Significance o Reliability Criticisms
  •                   May be statistically significant but not necessarily Practically significant
  •          o Problem due to the relationship of the degrees of freedom of the group and the Statistical Significance
  •                • As in some circumstances a significant result can come from a large population even if it only highlights a SMALL relationship between 2 variables.
  •                • In some of these cases the importance of a significant result reported may not hold any practical value as the effect of the relationship of the two variables are too miniscule.
  • – Null hypothesis is not correctly understood and therefore inadequately interpreted, Kirk 1996
  • – ALPHA LEVEL.
  •         o To control the probability of committing a type one and 2 error. Olejnik 1984 (http://0-pao.chadwyck.co.uk.unicat.bangor.ac.uk/articles/displayItem.do?QueryType=articles&QueryIndex=journal&ResultsID=135B2172EF21097761&ItemNumber=5&BackTo=journalid&BackToParam=QueryType=journals|ItemID=h101|issue=53:1%20(1984:Fall)&journalID=h101)
  •          o Kirk 1996; Young 1993 (cited here: http://www.ejournal.aiaer.net/vol21109/4.%20Latchanna%20&%20Masoomeh.pdf0 It has been pointed out that simply labeling results as significant or non-significant is rendered inadequate, as the probability is a continuous measure. For example if we labeled those who are 5ft tall as ‘normal height’ and regarded those shorter than this height ‘short’ – this would mean that even if an individual were to be ½ less than 5ft; they would not classify as ‘normal height’.
  •                   • A more relevant example practiced in research would be obtaining a p value of 0.49 and regarding this as non-significant whereas getting a p value of 0.51 as significant. However the difference of these two values is a meager 0.02 decimal. Should such a small difference lead to such contrasting classifications? – Which is also a contributor to the ‘File drawer problem’.
  •          o MISUSE OF RESULTS.
  •                   • Some associate significance testing with reliability or replicability. Vacha Haase & Nilsson 1998 http://www.ejournal.aiaer.net/vol21109/4.%20Latchanna%20&%20Masoomeh.pdf
  •                    • This then leads to the assumption that a p value of 0.001 is more important than a p value of 0.05. But p levels do not measure the chance that the result obtained will be reproduced when the experiment is replicated. Cohen 1994 (http://www.ics.uci.edu/~sternh/courses/210/cohen94_pval.pdf)
  •                    • The use of the term highly significant or stating that the small p value as evidence of a strong effect displayed is very misleading. As the smaller the p-value does not indicate the strength of the observed relationship or difference. Friedman 1968; Vacha-Haase & Thompson 1998 (cited here: http://www.ejournal.aiaer.net/vol21109/4.%20Latchanna%20&%20Masoomeh.pdf0)
  •                    • Linked to this incorrect assumption is also the notion that a non-significant results indicates that there was no effect present at all. When in actual fact it means that there was not enough evidence to reject the null hypothesis (at the set alpha level employed). Therefore a non-significant result does not mean that the null hypothesis was true – Schmidt and Hunter (1995) cited in (http://www.ejournal.aiaer.net/vol21109/4.%20Latchanna%20&%20Masoomeh.pdf0)

5 thoughts on “Criticisms of Significance Testing

  1. Jessicaaro says:

    Significance does not necessarily mean that experiments results are trust worthy and this can be due to type 1 and type 2 errors. A type 1 is where a null hypothesis is rejected incorrectly, so in other word a researcher claims their result is significant when it isn’t. A type 2 error is when a researcher fails to reject a false null hypothesis . These errors are mistakes in a significance test that are usually caused by human mistakes to do with the design, methodology and the running of the statistical tests on the results. The more significance tests you run in a experiment has the risk of lowering the power level in your experiment. An example is if you have a experiment with many conditions and run lots of t-tests between each variable then their will be a higher chance of getting a significant result, but because of that probability of your results happening by chance lower (the power of your experiment is lowered). This is a perfect example of a type 1 error and the experimenter may reject a true null hypothesis. Recently research had been more concerned about the chance of a type 2 error, and when put in a certain context such as looking for cures for cancer you can see why. Imagine if there was a new treatment that could be the answer to curing cancer but it was discarded because it was not tested properly (in essence accepting a false null hypothesis). This treatment could have saved many lives but because of inappropriate testing conditions it was not discovered would be a terrible blunder.

    References
    http://www.graphpad.com/library/BiostatsSpecial/article_152.htm
    http://experimentaltheology.blogspot.co.uk/2010/09/theology-of-type-1-type-2-errors.html
    http://www.nontoxic.org.uk/?p=205
    http://ceaccp.oxfordjournals.org/content/7/6/208.full

  2. camdowning says:

    You mention that significance testing has little practical use as it doesn’t tell us the size of the significance nor whether it will be truly effective in a practical situation. However, effect sizes supplies us with such information. Effect sizes should be reported following formal statistical significance tests (Baguley, 2009). They allow researchers to give a quantitative, interpretable description of the size of the effect (Fritz, Morris & Richler, 2012) or in a practical manner how much a specific treatment has worked. For example Titov et al. (2011) reported effect sizes for internet based treatment of anxiety and depression to demonstrate that the treatment had positive effect effects on between 50-60% of participants, conveying the practical use of such treatments. To reduce the file draw problem as you mentioned, effect sizes can also be used in meta-analysis rather than significance levels (Rosenthal & Rubin, 1986).

    Baguley, T. (2009). Standardized or simple effect size: what should be reported? British Journal of Psychology, 100(3), 603-617. doi: 10.1348/000712608X377117

    Fritz, C.O., Morris, P.E., & Richler, J.J. (2012). Effect size estimates: current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1), 2-18. doi: 10.1037/a0024338

    Rosenthal, R., & Rubin, D.B. (1986). Meta-analytic procedures for combining studies with multiple effect sizes. Psychological Bulletin, 99(3), 400-406. doi: 10.1037/0033-2909.99.3.400

    Titov, N., Dear, B.F., Schwencke, G., Andrews, G., Johnston, L., Craske, M.G., & McEvoy, P. (2011). Transdiagnostic internet treatment for anxiety and depression: a randomised controlled trial. Behaviour Research and Therapy, 49(8), 441-452. doi: 10.1016/j.brat.2011.03.007

  3. psucac says:

    Well, what Statistical significance really is – measuring the likelihood of events occurring by chance. I remember how in first year psychology we were getting really excited when we got a significant result from our tests. I guess, we didn’t really know what significance was. All that a statistically significant result means is that you have got a very reliable statistic. However, you can not assume that the finding is an important one or that you can make some kind of decisions relying on it. Sometimes, you only get a significant result when you have a large sample size. So, very small differences are detected as significant and therefore you can trust the fact that the difference is real, but you can’t assume that it is large or important for that matter. So, I think that significance is a useful tool as long as we are careful with it. Getting a significant result means that the difference or relationship exists, but it is only half of the story.
    Reference: http://www.statpac.com/surveys/statistical-significance.htm

  4. […] 4. https://nim2152.wordpress.com/2012/03/25/criticisms-of-significance-testing/#comment-55 Share this:TwitterFacebookLike this:LikeBe the first to like this post. […]

Leave a comment