On automatic kernel density estimate-based tests for goodness-of-fit

Although estimation and testing are different statistical problems, if we want to use a test statistic based on the Parzen–Rosenblatt estimator to test the hypothesis that the underlying density function f is a member of a location-scale family of probability density functions, it may be found reasonable to choose the smoothing parameter in such a way that the kernel density estimator is an effective estimator of f irrespective of which of the null or the alternative hypothesis is true. In this paper we address this question by considering the well-known Bickel–Rosenblatt test statistics which are based on the quadratic distance between the nonparametric kernel estimator and two parametric estimators of f under the null hypothesis. For each one of these test statistics we describe their asymptotic behaviours for a general data-dependent smoothing parameter, and we state their limiting Gaussian null distribution and the consistency of the associated goodness-of-fit test procedures for location-scale families. In order to compare the finite sample power performance of the Bickel–Rosenblatt tests based on a null hypothesis-based bandwidth selector with other bandwidth selector methods existing in the literature, a simulation study for the normal, logistic and Gumbel null location-scale models is included in this work.


Introduction
Given X 1 , . . . , X n independent and identically distributed real-valued random variables from an absolutely continuous distribution with continuous density function f , it is well known that the unknown density function f may be estimated by using the Parzen-Rosenblatt estimator (Rosenblatt 1956;Parzen 1962) defined, for x ∈ R, by where K h (·) = K (·/h)/h, for h > 0, with K a kernel in R, that is, K is a bounded and symmetric probability density function, and the bandwidth h = h n is a sequence of strictly positive real numbers converging to zero as n tends to infinity, which we always assume along this paper (see Devroye and Györfi 1985;Silverman 1986;Bosq and Lecoutre 1987;Wand and Jones 1995;Simonoff 1996, andTsybakov 2009, for general reviews on density estimation).
Other than the estimation of the underlying probability density function, the kernel density estimator can also be used for testing the null hypothesis where F 0 is a parametric family of density functions, against a general alternative hypothesis. This idea was first explored in Bickel and Rosenblatt (1973) who considered, among other, two test statistics based on the L 2 distance between the nonparametric estimator f h and two parametric estimators of f under the null hypothesis. Focusing our attention on the case where F 0 is a location-scale family, that is, with g(x; θ 1 , θ 2 ) = f 0 ((x −θ 1 )/θ 2 )/θ 2 , and f 0 is a known probability density function on R, the Bickel-Rosenblatt test statistics we are interested in are given by and J n (h) = J n (X 1 , . . . , X n ; h) where the integrals are over R with respect to the Lebesgue measure,θ k , for k = 1, 2, are consistent estimators of θ k under H 0 , and * denotes the convolution operator. The theoretical properties of goodness-of-fit tests based on I n (h) and J n (h) were first studied by Bickel and Rosenblatt (1973) in the univariate case by using strong approximation techniques for empirical processes, and by Rosenblatt (1975) in the multivariate case, by using a Poissonization of sample size technique. However, a full description of their asymptotic behaviour was later provided in Fan (1994) by using the fact, first noticed in Hall (1984), that central limit theorems for the integrated squared error of kernel density estimators can be derived by using a central limit theorem for degenerate U-statistics with variable kernels (see Ghosh and Huang 1991;Fan 1998;Gouriéroux and Tenreiro 2001;Cao and Lugosi 2005, for other works on goodnessof-fit tests based on the kernel density estimator). Taking into account that the class F 0 is closed with respect to affine transformations, some authors argue that any reasonable statistic T n = T n (X 1 , . . . , X n ) for testing H 0 should be location-scale invariant, that is, it should satisfy the equality T n (a + bX 1 , . . . , a + bX n ) = T n (X 1 , . . . , X n ), for each a ∈ R and b > 0 (see Henze 2002, p. 469, Ebner andHenze 2020, p. 847). As we can easily see, this invariance property does not hold for the functionals I n (h) and J n (h) whenever we take for h a deterministic bandwidth, even whenθ 1 is location-scale equivariant andθ 2 is scale equivariant, that is, θ 1 (a + bX 1 , . . . , a + bX n ) = a + bθ 1 (X 1 , . . . , X n ) andθ 2 (a + bX 1 , . . . , a + bX n ) = bθ 2 (X 1 , . . . , X n ), for each a ∈ R and b > 0. However, if we further assume that h =ĥ(X 1 , . . . , X n ) depends on the observations and is scale equivariant, then I n (ĥ) and J n (ĥ) are location-scale invariant test statistics. This invariance property follows easily from the representations I n (ĥ) = n(ĥ/θ 2 ) {fĥ /θ 2 (y) − Kĥ /θ 2 * f 0 (y)} 2 dy, and J n (ĥ) = n(ĥ/θ 2 ) {fĥ /θ 2 (y) − f 0 (y)} 2 dy, wheref is the kernel estimator with kernel K and smoothing parameter h, based on the so-called scaled residuals Y n, j = (X j −θ 1 )/θ 2 , j = 1, . . . , n. Whenĥ takes the formĥ =θ 2 h with h a deterministic bandwidth, the statistic I n (ĥ) is considered in Bowman (1992) (see also Fan 1994, pp. 332-336) and the theoretical properties of the goodness-of-fit test based on I n (ĥ) are described in Tenreiro (2007) in the case where θ 1 and θ 2 are, respectively, the mean and the standard deviation of g(·; θ 1 , θ 2 ), andθ 1 =X n andθ 2 = S n , whereX n = n −1 n i=1 X i is the sample mean and S 2 n = n −1 n i=1 (X i −X n ) 2 is the sample variance. Moreover, Bowman (1992, p. 3) also suggests to take for the deterministic bandwidth h the asymptotic optimal bandwidth, in the sense of the mean integrated square error, for estimating the null density f 0 . In this case, we have with (see Bosq andLecoutre 1987, pp. 78-83 andWand andJones 1995, pp. 19-23), where R(ϕ) = ϕ(x) 2 dx and μ 2 (ϕ) = x 2 ϕ(x)dx, for an arbitrary real-valued measurable function ϕ, which leads to consider forĥ the null hypothesis-based bandwidth selector In the case of testing an hypothesis of normality, that is, is the standard Gaussian density, and taking K = φ andθ 2 = S n , this leads to the data-dependent bandwidtĥ This approach, also considered in Bowman and Foster (1993, p. 535) for testing a multivariate hypothesis of normality, was first suggested with apparent good results by Henze and Zirkler (1990, p. 3600; see also Ebner and Henze 2020) and the corresponding theoretical properties of the test statistic I n (ĥ) first established in Gürtler (2000). From an estimation perspective, the choice of the bandwidth is crucial to the performance of the kernel density estimator, this being one of the most studied topics in kernel density estimation, and several data-based approaches have been proposed for selecting h (see Wand andJones 1995, pp. 58-89, andalso Tenreiro 2017, p. 3440, where more recent bandwidth selection methods are mentioned). Although estimation and testing are different statistical problems, if we want to test H 0 through a test statistic based on the kernel density estimator, it may be found reasonable to select the smoothing parameter in such a way that f h is an effective estimator of the underlying density f irrespective of which of the null or the alternative hypothesis is true, a property that is not fulfilled by automatic bandwidth selector (9). Although some scepticism has been expressed about this approach by Bowman (1992, p. 3), mainly due to the extra source of variation introduced into the null distribution of the test statistic by the considered bandwidth selector, in this paper we intend to address this issue deeply by considering the situation where the data-dependent smoothing parameterĥ satisfies the relative consistency condition where h 0 = h 0 ( f ; K , n) is the exact optimal bandwidth in the sense that it minimizes the kernel density estimator mean integrated square error, that is, where For a square integrable density f , the existence of this exact optimal bandwidth for all n ∈ N can be established whenever the kernel K is continuous at zero with R(K ) < 2K (0) (see Chacón et al. 2007). Classical data-based bandwidth selectors such as the least squares cross-validation bandwidth or the two-stage direct plug-in bandwidth selector based on h 1 = h 1 ( f ; K , n), which are both described in Wand and Jones (1995, pp. 63-65, 71-72), are scale equivariant and satisfy (11). The remainder of this work is organised as follows. In Sect. 2 we describe the asymptotic behaviour of the Bickel-Rosenblatt test statistics I n (ĥ) and J n (ĥ) witĥ h =ĥ n (X 1 , . . . , X n ) a general data-dependent smoothing parameter. In a univariate context these results extend those obtained by Fan (1994), Gürtler (2000) and Tenreiro (2007). The limiting null distribution and the consistency of the considered Bickel-Rosenblatt tests for location-scale families are stated in Sect. 3. In Sect. 4 we conduct a simulation study to compare the finite sample power performance of the Bickel-Rosenblatt tests based on the null hypothesis-based bandwidth selectorĥ H 0 with other scale equivariant bandwidth selectorsĥ satisfying condition (11). We consider the cases of the normal, logistic and Gumbel null location-scale models. Althoughĥ H 0 does not satisfy this relative consistency condition unless f ∈ F 0 , we conclude that the tests based on it, especially those based on I n , are in general more powerful than, or at least as powerful as, those based on the considered bandwidth selectors that satisfy such condition. Some other data-driven bandwidths inspired in the methods considered in Cao and Van Keilegom (2006), Martínez-Camblor et al. (2008) and Martínez-Camblor and Uña-Álvarez (2013) in the context of smooth tests for the ksample problem are adapted to our context and included in the simulation study. These last bandwidth selectors, which can be computed by resampling, take the general form λĥ, whereĥ is a scale equivariant bandwidth selector (e.g.ĥ =ĥ H 0 ) andλ is a datadriven tuning parameter selector taking values in a finite set of tuning parameters (e.g. = {0.5, 0.75, 1, 1.5, 2}). Nevertheless, none of these bandwidth selectors have shown to be preferable toĥ H 0 . Section 5 includes a brief summary and some conclusions. For convenience of exposition, the proofs are deferred to "Appendix A" and some of the simulation results are relegated to the online supplementary material. h n (X 1 , . . . , X n ) is a general data-dependent smoothing parameter. In a univariate framework the results presented here extend those obtained by Fan (1994), Gürtler (2000) and Tenreiro (2007).

Asymptotic behaviour of I n (ĥ)
In order to describe the asymptotic behaviour of the integrated square error I n (ĥ) we consider the following assumptions on the underlying probability density function f , the parametric family F 0 given by (2), the estimatorsθ 1 andθ 2 , the kernel K and the data-dependent bandwidthĥ. We denote by F an appropriate set of probability density functions on R that contains F 0 and to which the underlying probability density function f belongs, and by L r , for r ∈ [1, ∞], the normed vector space of measurable functions ϕ : R → R for which ||ϕ|| r < ∞, where ||ϕ|| r := |ϕ(x)| r dx 1/r < ∞ for r ∈ [1, ∞[, and ||ϕ|| ∞ = inf{c ≥ 0 : |ϕ(x)| ≤ c for almost every x}.

Assumption (K)
The kernel K belongs to K ω , for some ω ∈ {2, 3, . . .}, where K ω is the set of real-valued functions K on R with continuous derivatives up to order ω such that lim |u|→∞ u K (u) = 0, for which there exists η ∈ ]0, 1[, such that the realvalued functions K ,η defined, for u ∈ R, by K ,η (u) = |u| sup |h−1|≤η |K ( ) (u/h)|, are bounded and integrable on R for = 0, 1, . . . , ω. The standard Gaussian kernel K = φ belongs to K ω for all ω, and every kernel with compact support with continuous derivatives up to order ω belongs to K ω .

Assumption (B)
For all f ∈ F , there exists a deterministic sequence (h n ( f )) = (h( f )) of strictly positive real numbers satisfying h( f ) → 0 and nh( f ) → ∞, as n → ∞, such that As mentioned before, under some conditions on f and K , assumption (B) is fulfilled by the least squares cross-validation bandwidth and by the two-stage direct plug-in bandwidth selector with h( f ) = h 0 , where h 0 is given by (12). Of course, in these cases assumption (B) is also fulfilled with h( f ) = h 1 , where h 1 is given by (7), as h 0 and h 1 are asymptotically equivalent (see Hall and Marron 1991, p. 159). From a density estimation point of view, the distinction between bandwidth selectors is usually based on the rate of convergence to zero of the relative error ξ n . For example, we have ξ n = O p n −1/10 for the least squares cross-validation bandwidth (see Scott and Terrel 1987;Hall and Marron 1987), and ξ n = O p n −5/14 for the two-stage direct plug-in bandwidth selector (see Tenreiro 2003). A better order of convergence is achieved by the smoothed cross-validation method of Hall et al. (1992) and by the plug-in method of  for which we have ξ n = O p n −1/2 . Note that these rates of convergence are not directly comparable since the conditions imposed to f in each case are not necessarily the same. A different situation occurs whenĥ is the well-known normal scale bandwidth selector defined byĥ where c K is given by (8) andσ is a consistent estimator of the standard deviation σ f of f (see Wand and Jones 1995, p. 60). Although this bandwidth selector satisfies assumption (B) with h( f ) = (8 √ π/3) 1/5 c K n −1/5 σ f , and we have ξ n = O p n −1/2 whenever the scale estimator is such thatσ − σ f = O p n −1/2 , the normal scale bandwidth selector does not fulfil relative consistency condition (11).
In the next result, which proof is given in Sect. A.1, we describe the asymptotic behaviour of the Bickel-Rosenblatt test statistic I n (ĥ) given by (3) wherê h =ĥ n (X 1 , . . . , X n ) is a general data-dependent smoothing parameter. Recall that R(ϕ) = ϕ(x) 2 dx for an arbitrary real-valued measurable function ϕ.
(a) If the null hypothesis is true, then where (b) If the alternative hypothesis is true, then

Asymptotic behaviour of J n (ĥ)
In order to describe the asymptotic behaviour of the integrated square error J n (ĥ) some additional assumptions are needed.
where h 0 is given by (12), In the next result, which proof is given in Sect. A.2, we describe the asymptotic behaviour of the Bickel-Rosenblatt test statistic J n (ĥ) given by (4) wherê h =ĥ n (X 1 , . . . , X n ) is a general data-dependent smoothing parameter.

Theorem 2 Under assumptions
(a) If the null hypothesis is true, then and Remark 2 Under the conditions of Remark 1, condition (14) holds if α > max(1/10, 9/ (10ω)). Therefore, it is not fulfilled by the least squares cross-validation bandwidth selector, and it holds for the two-stage direct plug-in bandwidth selector whenever ω ≥ 3.

Bickel-Rosenblatt tests for location-scale families
Under the assumptions of Theorems 1 and 2, ifθ 1 andθ 2 are location-scale and scale equivariant estimators of θ 1 and θ 2 , respectively, and the deterministic sequence , a property that is satisfied by exact optimal bandwidth (12), we can easily conclude that . Therefore, from Theorems 1 and 2 we deduce that the tests based on the critical regions is the quantile of order 1 − α of the standard normal distribution, are asymptotically of level α and consistent to test f ∈ F 0 against f ∈ F \F 0 , that is, P f C n (T n , α) → α, for all f ∈ F 0 , and P f C n (T n , α) → 1, for all f ∈ F \F 0 , where T n = T n (X 1 , . . . , X n ) stands for either I n (X 1 , . . . , X n ;ĥ(X 1 , . . . , X n )) or J n (X 1 , . . . , X n ;ĥ(X 1 , . . . , X n )). Such as in the case whereĥ is deterministic (see Fan 1995, p. 372), some simulation results reveal that the asymptotic normal distribution provides a poor approximation to the finite sample distributions of I n (ĥ) and J n (ĥ) under the null hypothesis, which implies large differences between the true level and the nominal level of the tests based on the previous critical regions. This fact is illustrated in Table 1 where type I error estimates based on 20,000 simulations under the null hypothesis are shown for the normality tests based on the previous critical regions with K = φ andĥ =ĥ H 0 given by (10).
In order to circumvent this problem, the standard strategy (see Fan 1995, pp. 372-373) is to consider instead the test defined by the critical region where T n = T n (X 1 , . . . , X n ) stands for either I n (X 1 , . . . , X n ;ĥ(X 1 , . . . , X n )) or J n (X 1 , . . . , X n ;ĥ(X 1 , . . . , X n )), and q(T * n , α) = q(T * n , α; X 1 , . . . , X n ) denotes the quantile of order 1−α of the distribution of the random variable T * n defined as follows: (1) Use the original sample X 1 , . . . , X n to computeθ 1 andθ 2 ; Table 1 Type I error estimates for the normality tests based on the critical regions C n (I n (ĥ), α) and C n (J n (ĥ), α), with K = φ andĥ =ĥ H 0 , for nominal significant levels α = 0.1, 0.05, 0.01 and sample sizes n = 10 k , k = 2, 3, 4 Level α I n (ĥ H 0 ) J n (ĥ H 0 ) n = 10 2 n = 10 3 n = 10 4 n = 10 2 n = 10 3 n = 10 4 (2) Draw a random sample U 1 , . . . , U n from the distribution f 0 and define the bootstrap sample by X * n,i =θ 1 +θ 2 U i , for i = 1, . . . , n; (3) Use the bootstrap sample to compute T n (X * n,1 , . . . , X * n,n ) and call it T * n . Of course, if the test statistic T n is location-scale invariant, which occurs if we further assume thatĥ is scale equivariant, the quantile q(T * n , α), which does not depend on the observed sample, is the quantile of order 1 − α of T n under H 0 , we denote by q(T n , α). This quantile is assumed to be a known quantity as is can be well approximated by repeating steps 2) and 3) for a large number of times. As stated in the next result, which proof is presented in Sect. A.3, in this important case the test based on the critical region C (T n , α), has a level of significance at most equal to α for each sample size n and is consistent to test f ∈ F 0 against f ∈ F \F 0 .
Theorem 3 Under the assumptions of Theorems 1 or 2, let us assume thatθ 1 andθ 2 are location-scale and scale equivariant estimators of θ 1 and θ 2 , respectively. If the bandwidth selectorĥ is scale equivariant, then the test statistic T n , where T n stands for either I n (ĥ) or J n (ĥ), is location-scale invariant, and the test based on the critical region where α ∈ ]0, 1[, is such that

Finite sample results
In this section we conduct a simulation study to compare the finite sample power performance of goodness-of-fit tests based on critical regions (15) for several choices of the scale equivariant bandwidth selectorĥ. More precisely, we intend to compare the null hypothesis-based bandwidth selectorĥ H 0 proposed by Bowman (1992) given by (9), with other scale equivariant bandwidth selectorsĥ satisfying relative consistency condition (11), for which it is expected, at least from an asymptotic point of view, that the kernel estimator fĥ is an effective estimator of the underlying density f irrespective of which of the null or the alternative hypothesis is true. To this end, besidesĥ H 0 three other automatic and scale equivariant bandwidth selectors are considered in our study. They are the least squares cross-validation bandwidth selectorĥ CV , the two-stage direct plug-in bandwidth selectorĥ PI (see Wand and Jones 1995, pp. 63-65, 71-72) and also a modified version of the bandwidth selector proposed in Chacón and Tenreiro (2013), where the cross-validation function is replaced by the weighted cross-validation function with γ = 0.5 (for the definition of the weighted crossvalidation function, see Tenreiro 2017, p. 3440). Under some conditions on f ,ĥ CT fulfils assumption (B) with h( f ) = h 0 and ξ n = O p n −5/14 (see Tenreiro 2013, Theorem 3.1, p. 2207). The power results observed in our simulation study for the bandwidthsĥ CV ,ĥ PI andĥ CT reveal that this latter bandwidth presents a good overall performance for a wide range of alternative density features, which is relevant for real data situations where there is usually little prior information on the alternative density shape. For this reason, and because no essential feature is lost, hereafter we confine to the results obtained by the bandwidthsĥ H 0 andĥ CT . From representations (5) and (6), and taking for K the standard normal density, which we always assume from now on, the test statistics I n (ĥ) and J n (ĥ) can be evaluated from the equalities Taking into account the convolution properties of the Gaussian densities (see Wand and Jones 1995, pp. 177-180), the calculation of I n (ĥ) and J n (ĥ) is especially simple for the normality test in which case no numerical integration is required. In this case, we have We start the study on the finite sample performance of the tests based on critical regions (15) for nominal levels α = 0.1, 0.05, 0.01, by considering the test of normality in which case the null model is given by (2) with f 0 = φ, and we takeθ 1 =X n andθ 2 = S n the maximum likelihood estimators of θ 1 and θ 2 under H 0 . As the test statistics I n (ĥ) and J n (ĥ) forĥ =ĥ H 0 andĥ =ĥ CT are invariant under null hypothesis (see Theorem 3), the quantiles of order 1 − α in critical regions (15) are estimated by performing 100,000 simulations under the null hypothesis. We consider alternative distributions from a well-known set of normal mixture densities considered in Marron and Wand (1992) which is often used in the context of kernel density estimation. This set is very rich, containing densities with a wide variety of features, such as kurtosis, skewness and multimodality. The densities of the considered alternatives jointly with the density of the normal distribution with the same mean and variance are shown in Fig. 1. The densities are identified as in Marron and Wand (1992), and the values for the parameters of this set of normal mixture densities are given in Table 1 of the same article. For the nominal level α = 0.05 and sample sizes n = 20, 50, 80 we report in Table 2 the power estimates based on 10,000 samples from the considered set of alternative densities. All the simulations in this work were carried out using programs written in the R language (R Development Core Team 2019).
Taking into account some simulation experiments, not presented here to save space, to estimate the mean integrated squared error of the kernel density estimator for each one of the bandwidthsĥ H 0 andĥ CT , we can conclude that the kernel estimator based on h H 0 performs better than that based onĥ CT for the normal mixture densities 2, 6, 8, 9 and 12 (for the considered sample sizes). This may explain the results shown in Table  2 where the tests based onĥ H 0 perform generally better than those based onĥ CT for alternatives 2, 6, 8 and 9, and they perform similarly for alternative 12. The opposite situation occurs for the remaining four normal mixtures where the kernel density estimator based onĥ CT performs much better than that based onĥ H 0 . However, only for the normal mixtures 4 and 15 the tests based onĥ CT perform clearly better than those based onĥ H 0 . For densities 3 and 7 the tests perform similarly. As the considered alternative densities are far from the null hypothesis density family in shape, we can conclude that even a low performing bandwidth selector from a density estimation point of view is good enough to detect such alternatives. In this situation, estimation and testing demand different answers regarding bandwidth selection. The results presented in Table 2 for the skewed unimodal density 2 also deserve an additional comment. This is an interesting case because density 2 is not far from the normal density in shape, and we may expect thatĥ H 0 , as based on the null density family, may reach good power results for alternative densities which are not far from the null density model in shape. The simulations results observed for density 2 support this idea. The results presented in Table 2 also show different performances for the tests based on the test statistics I n (ĥ) and J n (ĥ) no matter which bandwidth is used. The statistic J n (ĥ) seems to be more effective in detecting multimodal alternatives, whereas I n (ĥ) shows in general a better performance in the detection of unimodal alternatives. Based on the previous conclusions, we have good reasons to believe thatĥ H 0 may reach a good power performance for wide sets of alternative distributions. In order to examine in detail this question, other than the goodness-of-fit test for the normal distribution we also consider two other null location-scale models. They are the logistic model where f 0 (x) = (exp(−x/2)+exp(x/2)) −2 , for x ∈ R, and the Gumbel extreme value model where f 0 (x) = exp(−x − exp(−x)), for x ∈ R. For this latter family of distributions we take forθ 1 andθ 2 the maximum likelihood estimators of θ 1 and θ 2 , The power estimates are based on 10,000 samples from the considered alternatives which satisfŷ In the case of the goodness-of-fit test for the logistic distribution we use the moment estimatorsθ 1 =X n andθ 2 = √ 3 S n /π which are simpler to evaluate and nearly as efficient as the maximum likelihood estimators (see Johnson et al. 1995, pp. 127-130). Similarly to the goodness-of-fit test for the normal distribution, we are under the assumptions of Theorem 3 and the tests based on critical regions (15) are implemented as explained before. For comparison proposes, besides the bandwidth selectorsĥ H 0 andĥ CT , we consider in this study other bandwidth selectors which are based on the common principle that the bandwidth should be tuned in order to improve the power performance of the test. In order to implement this idea, we consider the set of scale equivariant bandwidths based onĥ, whereĥ stands forĥ H 0 orĥ CT , given bŷ where is a finite set of strictly positive real numbers that will act as tuning parameters. Besides the value λ = 1 associated with the reference bandwidthĥ, this set is meant to include tuning parameters smaller and larger than the unit. If we denote by T n,λ (X 1 , . . . , X n ) one the statistics I n (X 1 , . . . , X n ;ĥ λ (X 1 , . . . , X n )) or J n (X 1 , . . . , X n ;ĥ λ (X 1 , . . . , X n )), from the scale equivariant property ofĥ we know that T n,λ is location-scale invariant, and therefore the null distribution of T n,λ does not depend on f ∈ F 0 , where F 0 is given by (2). Therefore, the tests with critical regions where q(T n,λ , α) denotes the quantile of order 1 − α of T n,λ under H 0 , have levels of significance at most equal to α. As before, we assumed that these quantiles are known quantities as they can be well approximated by simulating under the null hypothesis for a large number of times (100,000 replications under the null hypothesis are used). The power properties of each one of the previous test procedures depend on λ which is the reason that its choice is usually crucial to obtain a performing test procedure. In order to make such a choice, we need to define a suitable location-scale invariant measurable function taking values in ,λ =λ(X 1 , . . . , X n ), called tuning parameter selector, on the basis of which we can consider a test procedure based on the critical region C (T n,λ , α) = {T n,λ > q(T n,λ , α)}, where q(T n,λ , α) denotes the quantile of order 1 − α of T n,λ under H 0 . This test has a level of significance at most equal to α for each sample size n.
The second method for selecting λ we consider is based on the observation that given the values T n,λ (X 1 , . . . , X n ) of the test statistics for the observed sample X 1 , . . . , X n , more evidence against the null hypothesis is obtained for smaller p-values. Therefore, to construct a powerful test it makes sense to minimise the bootstrap p-value along λ ∈ , an idea that was used in Martínez-Camblor et al. (2008, pp. 4014-4015); see also Martínez-Camblor and Uña-Álvarez (2009). Hence, we denote byλ 2 = λ 2 (X 1 , . . . , X n ) the tuning parameter selector given bŷ where X 0,l , for l = 1, . . . , n B 0 , are independent copies of the random variable X 0 with density f 0 .
The last method for selecting λ we consider was introduced in Martínez-Camblor and Uña-Álvarez (2013, p. 273) and is based on the idea that λ should be chosen in order to maximise the discrimination capability, between the null and the alternative hypotheses, of the diagnostic variable T n,λ expressed by the area under the ROC curve associated with it. As this area is given by P(T 0 n,λ < T 1 n,λ ), where T 0 n,λ and T 1 n,λ are independent random variable with the null and the alternative distributions of T n,λ , respectively (see Krzanowski and Hand 2009, pp. 26-28), we consider the tuning parameter selectorλ 3 =λ 3 (X 1 , . . . , X n ) defined bŷ where X 0,l , for l = 1, . . . , n B 0 , and X * k, j , for k = 1, . . . , B 1 and j = 1, . . . , n, are defined as before.
Taking into account that, conditionally on the sequences N = (N l ), U = (U l ) and X 0 = (X 0,l ), the previous tuning parameter selectors are location-scale invariants, we conclude that the tests based on critical region (17), whereλ stands for either λ 1 ,λ 2 orλ 3 , have levels of significance at most equal to α for each sample size n (conditionally on N , U and X 0 ). In the practical implementation of these tests we always take = {0.5, 0.75, 1, 1.5, 2}. For the normality goodness-of-fit tests we take B 0 = B 1 = 200, and the quantiles q(T n,λ , α) are estimated by performing 100,000 simulations under the null hypothesis. For the goodness-of-fit tests for the logistic and Gumbel distributions we take B 0 = B 1 = 100, and the quantiles are estimated by performing 50,000 simulations under the null hypothesis, because the evaluation of the corresponding test statistics is more time-consuming than in the normal case.
For the alternatives 4 and 8 from the Marron and Wand (1992) set of normal mixture densities, Tables 3 and 4 present power estimates, at level α = 0.05, for the normality goodness-of-fit tests based on critical regions (16) withĥ λ = λĥ, λ = 0. 25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 3, 5, and (17) with = {0.5, 0.75, 1, 1.5, 2}, whereĥ =ĥ H 0 ,ĥ CT . As mentioned before, for all samples sizes we see that the empirical power depends on λ. However, these two alternatives reveal different situations. For alternative 8 the best power results are in general observed for values of λ close or even equal to 1, and therefore the tests based onλ jĥ , for j = 1, 2, 3, are not expected to be more powerful than those based on the bandwidth selectorĥ. The figures in both tables support this idea. A similar situation occurs for alternative 4 and bandwidthĥ CT . However, when the bandwidthĥ H 0 is used for alternative 4, an alternative for which the kernel estimator based onĥ H 0 performs poorly from a density estimation point of view, we see that it is highly advisable to use a tuning parameter smaller than 1, which may explain the good results obtained by the tuning parameters selectorsλ 2 andλ 3 for the test based on I n and byλ 1 for the test based on J n .
For α = 0.01, 0.05, 0.1, and sample sizes n = 20, 50, 80, we present in Tables 5-7 (see the supplementary online material) estimates of the nominal levels of significance for the goodness-of-fit tests for the normal, logistic and Gumbel distributions, respectively, based on I n (ĥ) and J n (ĥ) for the different bandwidth selectorsĥ based onĥ H 0 andĥ CT . They are based on 20, 000 simulations under the null hypothesis.
These results indicate that all the tests have an effective level of significance very close to α. With some few exceptions, the estimated levels are inside the approximate 95% confidence interval for the preassigned nominal levels.
Although a larger set of alternative distributions, usually considered in power studies for testing the normal, logistic and Gumbel distributions, was considered in our study (see Epps and Pulley 1983;Meintanis 2004;Epps 2005;Romão et al. 2010), we limit ourselves to present in Tables 8-10 (normal distribution), Tables 11-13 (logistic distribution) and Tables 14-16 (Gumbel distribution) the empirical power results for some of these alternatives (see the supplementary online material). The first seven alternatives are from the following location-scale families: uniform, exponential, Laplace, Cauchy, normal, logistic and Gumbel. The remaining six alternatives are from the following families of distributions: Student, lognormal, Tukey, gamma, Weibull and beta. For the exact definition of the distributions included in these tables, see Epps (2005). We limit ourselves to present here the results obtained for the nominal level α = 0.05 and sample sizes n = 20, 50, 80. However, similar conclusions can be drawn for the nominal levels α = 0.1, 0.01 also considered in our study. For comparison purposes, we include in the previous tables power estimates for the classical Anderson-Darling (1954) goodness-of-fit test which is based on a weighted quadratic distance between the empirical distribution function and a parametric estimator of the distribution function of f under the null hypothesis (see Stephens 1986, and the references therein, for Table 3 Power estimates, at level α = 0.05, for the normality goodness-of-fit tests based on I n (λĥ H 0 ) and J n (λĥ H 0 ), with λ = 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 3, 5, and I n (λ jĥ H 0 ) and J n (λ jĥ H 0 ), j = 1, 2, 3, with = {0.5, 0.75, 1, 1.5, 2}, for alternatives 4 and 8 from the Marron and Wand (1992)  The number of replications used is 10,000 goodness-of-fit tests based on the empirical distribution function). In order to implement this test, the quantiles of order 1−α of the Anderson-Darling test statistic A 2 are estimated by performing 100,000 simulations under the null hypothesis. In the case of the goodness-of-fit test for the normal distribution we also include in our simulation study the highly recommended Shapiro-Wilk (1965) test SW implemented by the Rfunction shapiro.test. For all the tests included in the study, the power estimates are based on 10,000 samples from the considered alternatives. Table 4 Power estimates, at level α = 0.05, for the normality goodness-of-fit tests based on I n (λĥ CT ) and J n (λĥ CT ), with λ = 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 3, 5, and I n (λ jĥCT ) and J n (λ jĥCT ), j = 1, 2, 3, with = {0.5, 0.75, 1, 1.5, 2}, for alternatives 4 and 8 from the Marron and Wand (1992)  The number of replications used is 10,000 Although none of the considered tests present uniformly better results for the considered set of alternative distributions, the main conclusion that can be drawn from this study is that the tests based onĥ H 0 present in fact a good overall performance for a wide set of alternative distributions. Regarding the two tests based onĥ H 0 , our preference goes to the test based on the test statistic I n (ĥ H 0 ). This test is in general more powerful than, or at least as powerful as, the tests based onĥ CT , and also proves to be quite competitive against the Anderson-Darling test, although slightly less per-forming than the Shapiro-Wilk test for normality. However, no matter the considered null hypothesis model, for some of the considered alternatives, such as the light-tailed alternatives uniform and beta, the test based on J n (ĥ H 0 ) shows to be more powerful than that based on I n (ĥ H 0 ). Finally, note that the new bandwidth selectorsλ jĥ H 0 or λ jĥCT , for j = 1, 2, 3, which are much more time-consuming to compute thanĥ H 0 orĥ CT , do not reveal in general any special advantage over these simple to compute bandwidths, the exception being the Tukey(5) alternative distribution for the normal and the logistic models. As some simulation experiments reveal (not presented here), the extra source of variation they introduce into the null hypothesis distribution of the associated test statistics, especially those based on J n , may explain the observed results.

Conclusions
The choice of the bandwidth is crucial to the performance of the Parzen-Rosenblatt estimator and several automatic bandwidth selectors considered in the literature satisfy relative consistency condition (11). This is not the case of the null hypothesis-based bandwidth selectorĥ H 0 that only satisfies this condition when the null hypothesis is true. However, if we want to use the Bickel-Rosenblatt test statistics to test the hypothesis that the underlying density function f is a member of a location-scale family of probability density functions, the finite sample results presented in this paper support the idea that the tests based onĥ H 0 present a good overall performance for a wide set of alternative distributions. These tests are in general more powerful than, or at least as powerful as, those based on data-dependent smoothing parameterŝ h that satisfy the relative consistency condition irrespective of which of the null or the alternative hypothesis is true, as well as those inspired on existing data-driven bandwidths for smooth tests for the k-sample problem which can be computed by resampling.

A.1 Proof of Theorem 1
Consider the expansion In order to establish the asymptotic behaviour of each one of the previous terms, we use the approach of Tenreiro (2001), which is based on the Taylor expansion Note that, from assumption (K) the functions K ∂( ) are bounded and integrable on R, for = 1, . . . , ω − 1, and there exists η ∈ ]0, 1[ such that the function K ∂(ω),η (u) := sup |h−1|≤η |K ∂(ω) (u, h)|, is bounded and integrable on R. From the previous Taylor expansion we deduce the following expansions for fĥ, Kĥ * f and Kĥ * g(·;θ 1 ,θ 2 ), that play a crucial role in what follows. For x ∈ R and denoting by h the deterministic bandwidth h( f ) given in assumption (B), we have and Each one of the terms in (21) is studied in the following propositions. We denote by h the deterministic sequence h( f ) given in assumption (B).

Proposition 1 We have
where U n given by (25) is asymptotically normal with zero mean and variance 2R(K * K )R( f ).

Proposition 2 We have
Moreover, under the null hypothesis we have I n,2 = O p n −1 .
To establish the order of convergence of I n,3 we need the following lemma. Note that we are always assuming thatĥ satisfies assumption (B).
(c) Ifγ n =γ n (·; X 1 , . . . , X n ) : R → R is such that ||γ n || r = O p (1), for some r ∈ [1, ∞], then Proof Write S n,a , S n,b and S n,c for the sums considered in each one of the parts a), b) and c). The order of convergence stated in part a) follows from the inequalities In order to establish parts b) and c), it is enough to note that for all ∈ ]0, η[ and for |ĥ/h − 1| ≤ we have  (22), (23) and (24) we have Each one of these terms will be studied in the following propositions. As before, we denote by h the deterministic sequence h( f ) which existence is assured by assumption (B).

Proposition 4 We have
where U n is defined in Proposition 1 and V n given by (31) is asymptotically normal with zero mean and variance μ 2 (K ) 2 Var f ( f (X 1 )).
Proof Taking into account equality (22) Using degenerated U-statistics techniques (see Hall 1984) we know that with U n given by (25) and is asymptotically normal with zero mean and variance equal to μ 2 (K ) 2 Var f ( f (X 1 )).
Moreover, under the null hypothesis we have J n,2 = O p n −1 .
We can now conclude the proof of Theorem 2. From Proposition 4 and assumption (B') we have J n,1 = O p (nh) −1 + ξ ω n .