Difference between revisions of "Data dredging"

From Wiki @ Karl Jones dot com
Jump to: navigation, search
(Created page with "'''Data dredging''' (also '''data fishing''', '''data snooping''', and p-hacking) is the use of data mining to uncover patterns in data that can be presented as statistica...")
 
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''Data dredging''' (also '''data fishing''', '''data snooping''', and p-hacking) is the use of [[data mining]] to uncover patterns in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality.
+
'''Data dredging''' (also '''data fishing''', '''data snooping''', and p-hacking) is the use of [[data mining]] to uncover patterns in data that can be presented as [[Statistical significance|statistically significant]], without first devising a specific hypothesis as to the underlying causality.
  
 
== Description ==
 
== Description ==
Line 9: Line 9:
 
== See also ==
 
== See also ==
  
 +
* [[Base rate fallacy]]
 +
* [[Bonferroni inequalities]]
 +
* [[Cherry picking]]
 
* [[Data mining]]
 
* [[Data mining]]
 +
* [[Lincoln–Kennedy coincidences urban legend]]
 +
* [[Look-elsewhere effect]]
 +
* [[Misuse of statistics]]
 +
* [[Multiple comparisons problem]]
 +
* [[Overfitting]]
 +
* [[Pareidolia]]
 +
* [[Post hoc analysis]]
 +
* [[Predictive analytics]]
 +
* [[Statistical significance]]
  
 
== External links ==
 
== External links ==
  
 
* [https://en.wikipedia.org/wiki/Data_dredging Data dredging] @ Wikipedia
 
* [https://en.wikipedia.org/wiki/Data_dredging Data dredging] @ Wikipedia
 +
 +
[[Category:Computer science]]
 +
[[Category:Data]]
 +
[[Category:Information]]
 +
[[Category:Metadata]]
 +
[[Category:Privacy]]

Latest revision as of 12:49, 14 November 2016

Data dredging (also data fishing, data snooping, and p-hacking) is the use of data mining to uncover patterns in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality.

Description

The process of data mining involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching for combinations of variables that might show a correlation. Conventional tests of statistical significance are based on the probability that an observation arose by chance, and necessarily accept some risk of mistaken test results, called the significance. When large numbers of tests are performed, some produce false results, hence 5% of randomly chosen hypotheses turn out to be significant at the 5% level, 1% turn out to be significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some falsely appear statistically significant, since almost every data set with any degree of randomness is likely to contain some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these results.

The multiple comparisons hazard is common in data dredging. Moreover, subgroups are sometimes explored without alerting the reader to the number of questions at issue, which can lead to misinformed conclusions.

See also

External links