with H. Chernoff, T. Zheng and SH Lo.
We propose approaching prediction from a framework grounded in the theoretical correct prediction rate of a variable set (“predictivity”) and the rate as a parameter of interest. This allows us to define a novel measure of predictivity based on observed data, which in turn enables assessing variable sets for, preferably high, predictivity. While intuitively obvious, not nearly enough attention has been paid to the consideration of predictivity as an estimatable parameter of interest. Motivated by the needs of current genome-wide association studies (GWAS), we provide such a discussion. We first describe the correct prediction rate, the parameter of interest, for a variable set. We then consider and ultimately reject an estimator that approximates the correct prediction rate of a variable set using sample data due to the
estimator’s inability to distinguish between noisy and predictive variables, which directly leads to an inability to estimate without inflated bias. In response, we offer an alternative parameter that describes the predictivity of a variable set — a lower bound to the correct prediction rate. We demonstrate that the Partition Retention method’s I-score can be used to compute a measure that asymptotically approaches this lower bound. The I-score can effectively differentiate between noisy and predictive variables as well, making it helpful in variable selection. We offer simulations
and an application of the I-score on real data to demonstrate the statistic’s
predictive performance on sample data. These show that the I-score can capture highly predictive variable sets, estimates a lower bound for the theoretical correct prediction rate and correlates well with the out of sample correct rate. We conjecture that using the Partition Retention and I-score can aid in finding variable sets with promising prediction rates, however, further research in the avenue of sample-based measures of predictivity is much desired.