## 5.11 One Sample and Two Sample Tests

One-sample tests are used to compare the data set to a fixed criterionGeneral term used in this document to identify a groundwater concentration that is relevant to a project; used instead of designations such as Groundwater Protection Standard, clean-up standard, or clean-up level. (for example, population meanThe arithmetic average of a sample set that estimates the middle of a statistical distribution (Unified Guidance)., population percentile). Examples of one-sample tests have already been implicitly presented in Section 5.3 (Tolerance Limits) and Section 5.4 (Prediction Limits), as well as Section 5.6 (Distributional Tests). Other examples are goodness-of-fit tests, where, for example, you would like to know if the data support predictions regarding the value of the population mean. The null hypothesisOne of two mutually exclusive statements about the population from which a sample is taken, and is the initial and favored statement, H₀, in hypothesis testing (Unified Guidance). would be:

H₀: µ = µ_{o} where μ = actual true population mean

µ₀ = hypothesized population mean (under H₀)

and the alternative hypothesis H_{A}: µ <> µ₀.

A one sample t-testA t-test, or two-sample test, is a statistical comparison between two sets of data to determine if they are statistically different at a specified level of significance (Unified Guidance). can be applied in this case, if the following assumptions hold:

- the data are normally distributed
- the sample drawn from the population is random
- the cases of the samples are independent
- the population mean is known

However, many groundwater monitoring scenarios require the comparison of two populations, such as a population of compliance (potentially impacted) data to a population of spatial or temporal backgroundNatural or baseline groundwater quality at a site that can be characterized by upgradient, historical, or sometimes cross-gradient water quality (Unified Guidance). (unimpacted) data. The statistical tests used for these comparisons are referred to as two-sample tests and are used to determine if the two populations are statistically different at a specified level of significance. Examples of parametricA statistical test that depends upon or assumes observations from a particular probability distribution or distributions (Unified Guidance). two-sample tests include Welch’s t-test and the pooledGroundwater samples from more than one sampling point. varianceThe square of the standard deviation (EPA 1989); a measure of how far numbers are separated in a data set. A small variance indicates that numbers in the dataset are clustered close to the mean. t-test. Nonparametric tests include the Wilcoxon rank sum test, the signed rank test, and the Tarone-Ware two sample test for censored dataValues that are reported as nondetect. Values known only to be below a threshold value such as the method detection limit or analytical reporting limit (Helsel 2005).. These two-sample tests and their applications are described briefly below. Table F-3 includes information about checking assumptions for two sample tests.

### 5.11.1 Welch’s T-test

Welch’s t-test assumes that each population is normally distributed and requires that no temporal trends exist in the data, no spatial variabilitySpatial variability exists when the distribution or pattern of concentration measurements changes from well location to well location (most typically in the form of differing mean concentrations). Such variation may be natural or synthetic, depending on whether it is caused by natural or artificial factors (Unified Guidance). is present, and samples are statistically independent. One advantage of Welch’s t-test is that it does not require you to assume that population variances are equal. Another advantage is that while Welch’s t-test provides statistical powerStrength of a test to identify an actual release of contaminated groundwater or difference from a criterion (Unified Guidance). comparable to other two-sample tests, it is much simpler to use than other similar tests. The only calculations required are computing the mean, standard deviation, variance, t-statistic, and degrees of freedomThe number of ways which members of a data set or data sets can be independently varied (Unified Guidance).. Many statistical software packages offer Welch’s t-test, but most do not determine if the requirements and assumptions are met.

When applying Welch's t-test, the calculated t-value is compared to a critical t-value which is based on the selected significance level of the test and on the number of degrees of freedom. If the calculated t-value is less than or equal to the critical value, then no evidence exists for a statistically significant difference between the two population means at the selected confidence levelDegree of confidence associated with a statistical estimate or test, denoted as (1 – alpha) (Unified Guidance).. The equations for the necessary calculations, including the critical t-values for common significance levels, can be found in most statistical texts and in the Unified Guidance.

- Study Question 2: Are concentrations greater than background concentrations?
- Study Question 5: Is there a trend in contaminant concentrations?

Data are normally distributed. This test will still provide relatively reliable results if data are not heavily skewed (coefficient of variation is less than or equal to 1.5).

- No naturally-occurring spatial variability can be present.
- Samples must be spatially and temporally independent.
- No temporal trends in the data can be present.
- Use of 8 to 10 measurements is recommended, a larger data set may be required if the data are skewed or contain nondetectsLaboratory analytical result known only to be below the method detection limit (MDL), or reporting limit (RL); see "censored data" (Unified Guidance)..

- This test does not require equal population variances.
- This test should not be used on lognormalA dataset that is not normally distributed (symmetric bell-shaped curve) but that can be transformed using a natural logarithm so that the data set can be evaluated using a normal-theory test (Unified Guidance). data which are transformed from normal data.
- This test is simpler to use than other two-sample tests with comparable statistical power.

Additional information on Welch’s t-test including examples of how to perform the test can be found in Chapter 16.1.2 and Chapter 16.1.3, Unified Guidance.

### 5.11.2 Pooled Variance T-test

The pooled variance t-test shares the same underlying assumptions and requirements of Welch’s t-test but, provides greater statistical power and therefore is helpful in identifying smaller differences. However, the pooled variance t-test has the added requirement that the variances of the two populations be equal; this requirement can be evaluated using box plots, or more robust methods such as Levene's test for equal variances (see Section 11.2, Unified Guidance). If these assumptions are met, the t-statistic can be calculated. Many statistical software packages offer versions of the pooled variance t-test, but most do not determine if the requirements and assumptions are met.

As with Welch’s t-test, the calculated t-value is compared to a critical t-value, which is based on the selected significance level of the test and on the number of degrees of freedom. If the calculated t-value is less than or equal to the critical value, then no evidence exists of a statistically significant difference between the two population means at the specified confidence level. The equations for the necessary calculations, including the critical t-values for common significance levels, can be found in most statistical texts and in the Unified Guidance.

- Study Question 2: Are concentrations greater than background concentrations?
- Study Question 5: Is there a trend in contaminant concentrations?

This test assumes that data are normally distributed. If this assumption cannot be met, consider using other parametric or nonparametricStatistical test that does not depend on knowledge of the distribution of the sampled population (Unified Guidance). two-sample tests such as those discussed in this section.

- No naturally-occurring spatial variability can be present.
- This method requires spatially and temporally independent samples.
- No temporal trends can be present in the data.
- Population variances must be equal.
- If you suspect outliersValues unusually discrepant from the rest of a series of observations (Unified Guidance)., examine the data using a probability plot, Dixon’s test, Rosner’s test, or another appropriate method.
- See Section 5.7 for information regarding the handling of nondetects.
- Use of 8 to 10 measurements is recommended, a larger data set may be required if the data are skewed or contain nondetects.

- This method is relatively simple to implement and interpret (when assumptions are met).
- Use on lognormal data which are transformed is not recommended.

Additional information on the Pooled Variance t-test, including examples of how to perform the test can be found in Chapter 16.1.1, Unified Guidance.

### 5.11.3 Wilcoxon Rank-sum Test

The Wilcoxon rank-sum test is a nonparametric two-sample test that may be used to compare two populations when the groundwater data are not normally-distributed and cannot be normalized by transformation. The Wilcoxon rank-sum test is equivalent to the Mann-Whitney U-test. Requirements for the Wilcoxon rank-sum test include the assumption of equal variances, the assumption of a common (unknown) distribution, a lack of spatial variability, and temporal stability. The Wilcoxon rank-sum test can handle data sets with a limited number of nondetects (10-15%) with uniform reporting limits.

As the name implies, the Wilcoxon rank-sum test is performed by ordering the combined data from smallest to largest and ranking the values from 1 to N. Tied values receive a midrank which is the average of the ranks they would receive were they not tied. The resulting numerical ranks of the background samples are denoted as B_{i} and the compliance samples are C_{i}. The Wilcoxon statistic (W) is computed as the sum of the compliance ranks and the result is standardized to compute a Z-score for comparison to a tabulated critical statistic. Calculations for W, the expected value E(W), standard deviation SD(W), and the test statistic Z, for data with no ties are available in most statistical references and the Unified Guidance.

A computed Z is greater than the tabulated critical Z at the selected significance level, indicates that the compliance well concentrations are statistically different from the background at the significance level.

The Wilcoxon rank-sum test is available in most statistical software packages as a default selection for nonparametrically-distributed data; however, most packages do not automatically evaluate for compliance with the necessary underlying requirements or assumptions.

- Study Question 2: Are concentrations greater than background concentrations?
- Study Question 5: Is there a trend in contaminant concentrations?

Although there is no assumption of normality, violations of the requirements listed below may invalidate the results of the test. Always verify that the data comply with the requirements.

- Equal population variances
- Common (shared) distribution between populations
- Absence of naturally-occurring spatial variability
- Samples are spatially and temporally independent
- Temporal stability
- The number of nondetects should be minimal (typically, less than 10 to 15%) and should be treated as tied data.
- Use of 8 to 10 measurements is recommended, a larger data set may be required if the data are skewed or contain nondetects.

- no requirement for normality
- can accommodate nondetects, but a large number of nondetects may decrease the usefulness of the result.

Additional information on the Wilcoxon Rank-Sum test including examples of how to perform the test can be found in Chapter 16.2, Unified Guidance.

### 5.11.4 Sign or Signed Rank Test

The signed rank test is used to evaluate differences between groups of “paired” data such as analytical results from a group of wells before and after remediation efforts. The signed rank test evaluates whether a statistically significant difference exists between the medians of two groups by evaluating the difference between each pair of observations. The pairs are ranked in ascending order of the absolute value of their difference, and each rank is multiplied by the sign of the paired difference. The sum of those products is the test statistic W, which is compared to a tabulated critical value that is based on the selected statistical significanceStatistical difference exceeding a test limit large enough to account for data variability and chance (Unified Guidance). A fixed number equal to alpha (α), the false positive rate, indicating the probability of mistakenly rejecting the stated null hypothesis (H₀) in favor of the alternative hypothesis (Hᴀ). Or, the p-value sufficiently low such that the analyst will reject the null hypothesis (H₀). of the test and the number of sample pairs (differences). A computed test statistic W greater than the tabulated critical W at the selected significance level, indicates that the two groups of data are statistically different at the selected significance level. The signed rank test is available in some statistical software packages and is relatively straightforward to implement in spreadsheet software.

Study Question 5: Is there a trend in contaminant concentrations?

- Use of 8 to 10 measurements is recommended, a larger data set may be required if the data are skewed or contain nondetects
- See Section 5.7 for information on handling of nondetect data.

- Data are paired and come from a common population.
- Each pair is independent of the other pairs.

- This test has no requirement for normality.
- This test is not designed to accommodate nondetects.

Additional information on the signed-rank test, including examples, can be found in a variety of statistical texts and guidance documents (Gilbert 1987).

### 5.11.5 Tarone-Ware Two-sample Test for Censored Data

The Tarone-Ware two-sample test provides the added versatility of dealing with nondetect data. Like other nonparametric tests, Tarone-Ware assumes identical distribution of background and compliance populations, and requires equal variances. Also, as with the other tests, the Tarone-Ware two-sample test also requires temporal stability and lack of spatial variability. To perform this test, the two data sets (for example, background and compliance data) are combined and the distinct (unique) detect values ordered from lowest to highest. The number of values (including nondetects) less than or equal to each ordered value is computed for compliance, background, and combined data. The Tarone-Ware statistic is then calculated using equations found in some statistical references, including the Unified Guidance. Variations of this test (such as Gehan’s (1965) generalized Wilcoxon test) are also found in some statistical software packages, although compliance with the underlying assumptions and requirements is generally not automatically evaluated.

A computed Tarone-Ware statistic (TW) greater than the tabulated critical value at the selected significance level, indicates, given the example above of comparing background and compliance data, that the compliance well concentrations are statistically different (greater) than the background at that significance level.

- Study Question 2: Are concentrations greater than background concentrations?
- Study Question 5: Is there a trend in contaminant concentrations?

- Equal population variances
- Samples are spatially and temporally independent
- Temporal stability

- Use of 8 to 10 measurements is recommended; a larger data set may be required if the data are skewed or contain nondetects.
- Although the method does not require normality, significant deviations from the requirements listed can invalidate results.
- Although this test is robust with respect to the presence of nondetects, general equality of variance should be visually checked using box plots.

- This test does not require normality in the data set.
- This test addresses nondetect related limitations found in other nonparametric methods.
- This test is not as available in standard environmental statistics software packages as other nonparametric methods.

Additional information on the Tarone-Ware two-sample test, including examples of how to perform the test, can be found in Chapter 16.3, Unified Guidance.

Publication Date: December 2013