^{1}

^{2}

^{*}

^{2}

^{3}

^{1}

^{2}

^{3}

Edited by: Rodney L. Honeycutt, Pepperdine University, United States

Reviewed by: Carlos Garcia-Verdugo, Jardín Botánico Canario “Viera y Clavijo”, Spain; Michael S. Brewer, University of California, Berkeley, United States

This article was submitted to Phylogenetics, Phylogenomics, and Systematics, a section of the journal Frontiers in Ecology and Evolution

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Selective inference is considered for testing trees and edges in phylogenetic tree selection from molecular sequences. This improves the previously proposed approximately unbiased test by adjusting the selection bias when testing many trees and edges at the same time. The newly proposed selective inference

A phylogenetic tree is a diagram showing evolutionary relationships among species, and a tree topology is a graph obtained from the phylogentic tree by ignoring the branch lengths. The primary objective of any phylogenetic analysis is to approximate a topology that reflects the evolution history of the group of organisms under study. Branches of the tree are also referred to as edges in the tree topology. Given a rooted tree topology, or a unrooted tree topology with an outgroup, each edge splits the tree so that it defines the clade consisting of all the descendant species. Therefore, edges in a tree topology represent clades of species. Because the phylogenetic tree is commonly inferred from molecular sequences, it is crucial to assess the statistical confidence of the inference. In phylogenetics, it is a common practice to compute confidence levels for tree topologies and edges. For example, the bootstrap probability (Felsenstein,

For illustrating the idea of selective inference, we first look at a simple example of 1-dimensional normal random variable

Observing _{0} : θ ≤ 0 against the alternative hypothesis _{1} : θ > 0. We denote the cumulative distribution function of

What happens when we test many hypotheses at the same time? Consider random variables _{i} ~ _{i}, 1), _{all}, not necessarily independent, with null hypotheses θ_{i} ≤ 0, where _{true} hypotheses are actually true. To control the number of falsely rejecting the _{true} hypotheses, there are several multiplicity adjusted approaches such as the family-wise error rate (FWER) and the false discovery rate (FDR). Instead of testing all the _{all} hypotheses, selective inference (SI) allows for _{select} hypotheses with _{i} > _{i} for constants _{i} specified in advance. This kind of selection is very common in practice (e.g., publication bias), and it is called as the

where _{select} ≪ _{all}.

Selective inference has been mostly developed for inferences after model selection (Taylor and Tibshirani,

Confidence intervals of regression coefficients for selected variables by lasso; see section 6.8 for details. All intervals are computed for confidence level 1 − α at α = 0.01. (Black) the ordinary confidence interval

Phylogenetic tree selection is an example of model selection. Since each tree can be specified as a combination of edges, tree selection can be interpreted as the variable selection of multiple regression, where edges correspond to the predictors of regression (Shimodaira,

For illustrating phylogenetic inference methods, we analyze a dataset consisting of mitochondrial protein sequences of six mammalian species with

The number of unrooted trees for six taxa is 105. These trees are reordered by their likelihood values and labeled as T1, T2, …, T105. T1 is the ML tree as shown in

Examples of two unrooted trees T1 and T7. Branch lengths represent ML estimates of parameters (expected number of substitutions per site). T1 includes edges E1, E2, and E3 and T7 includes E1, E6, and E8.

The result of phylogenetic analysis is summarized in

Three types of _{0}, β_{1}) for the best 20 trees.

_{0} |
_{1} |
||||||
---|---|---|---|---|---|---|---|

T1^{†} |
0.559 (0.001) | 0.752 (0.001) | 0.372 (0.001) | -0.41 (0.00) | 0.27 (0.00) | (((1(23))4)56) | E1, E2, E3 |

T2 | 0.304 (0.000) | 0.467 (0.001) | 0.798 (0.001) | 0.30 (0.00) | 0.22 (0.00) | ((1((23)4))56) | E1,E2,E4 |

T3 | 0.126 (0.002) | 0.202 (0.003) | 1.46 (0.01) | 0.32 (0.00) | (((14)(23))56) | E1, E2, E5 | |

T4 | 0.081 (0.002) | 0.124 (0.003) | 1.79 (0.01) | 0.40 (0.01) | ((1(23))(45)6) | E1, E3, E6 | |

T5 | 0.127 (0.002) | 0.199 (0.003) | 1.50 (0.01) | 0.36 (0.00) | (1((23)(45))6) | E1, E6, E7 | |

T6 | 0.050 (0.002) | 2.21 (0.02) | 0.35 (0.01) | (1(((23)4)5)6) | E1, E4, E7 | ||

T7^{‡} |
0.100 (0.003) | 0.150 (0.003) | 1.72 (0.01) | 0.44 (0.01) | ((1(45))(23)6) | E1, E6, E8 | |

T8 | 2.74 (0.03) | 0.43 (0.02) | ((15)((23)4)6) | E1, E4, E9 | |||

T9 | 3.67 (0.09) | 0.46 (0.04) | (((1(23))5)46) | E1, E3, E10 | |||

T10 | 2.43 (0.02) | 0.42 (0.01) | (((15)4)(23)6) | E1, E8, E9 | |||

T11 | 3.14 (0.07) | 0.51 (0.03) | (((14)5)(23)6) | E1, E5, E8 | |||

T12 | 3.78 (0.09) | 0.41 (0.04) | (((15)(23))46) | E1, E9, E10 | |||

T13 | 3.96 (0.19) | 0.54 (0.09) | (1(((23)5)4)6) | E1, E7, E11 | |||

T14 | 4.66 (0.31) | 0.65 (0.12) | ((14)((23)5)6) | E1, E5, E11 | |||

T15 | 5.28 (0.34) | 0.43 (0.11) | ((1((23)5))46) | E1, E10, E11 | |||

T16 | 3.63 (0.04) | 0.23 (0.01) | ((((13)2)4)56) | E2, E3, E12 | |||

T17 | 3.81 (0.04) | 0.22 (0.01) | ((((12)3)4)56) | E2, E3, E13 | |||

T18 | 4.33 (0.10) | 0.34 (0.03) | (((13)2)(45)6) | E3, E6, E12 | |||

T19 | 4.36 (0.11) | 0.32 (0.04) | (((12)3)(45)6) | E3, E6, E13 | |||

T20 | 3.90 (0.12) | 0.44 (0.05) | (((1(45))2)36) | E6, E8, E14 |

Three types of _{0}, β_{1}) for all the 25 edges of six taxa.

_{0} |
_{1} |
|||||
---|---|---|---|---|---|---|

E1^{†}^{‡} |
-3.87 (0.03) | 0.16 (0.01) | ||||

E2^{†} |
0.930 (0.000) | 0.903 (0.001) | -1.59 (0.00) | 0.12 (0.00) | ||

E3^{†} |
0.580 (0.001) | 0.719 (0.001) | 0.338 (0.001) | -0.39 (0.00) | 0.19 (0.00) | |

E4 | 0.318 (0.000) | 0.435 (0.001) | 0.775 (0.001) | 0.32 (0.00) | 0.16 (0.00) | |

E5 | 0.124 (0.002) | 0.198 (0.002) | 1.47 (0.01) | 0.32 (0.00) | ||

E6^{‡} |
0.060 (0.000) | 0.074 (0.001) | 0.141 (0.002) | 1.50 (0.00) | 0.05 (0.00) | |

E7 | 0.091 (0.002) | 0.154 (0.002) | 1.56 (0.01) | 0.22 (0.00) | ||

E8^{‡} |
0.068 (0.002) | 0.110 (0.003) | 1.80 (0.01) | 0.31 (0.01) | ||

E9 | 2.48 (0.02) | 0.27 (0.02) | ||||

E10 | 3.72 (0.07) | 0.29 (0.03) | ||||

E11 | 4.31 (0.10) | 0.35 (0.03) | ||||

E12 | 3.68 (0.05) | 0.17 (0.02) | ||||

E13 | 3.90 (0.04) | 0.15 (0.02) | ||||

E14 | 4.03 (0.09) | 0.30 (0.04) | ||||

E15 | 4.03 (0.13) | 0.38 (0.06) | ||||

E16 | 4.44 (0.05) | 0.12 (0.01) | ||||

E17 | 4.70 (0.07) | 0.19 (0.02) | ||||

E18 | 3.94 (0.09) | 0.26 (0.04) | ||||

E19 | 5.23 (0.43) | 0.57 (0.13) | ||||

E20 | 5.66 (0.29) | 0.28 (0.09) | ||||

E21 | 6.38 (0.33) | 0.24 (0.08) | ||||

E22 | 5.62 (0.21) | 0.17 (0.07) | ||||

E23 | 4.86 (0.43) | 0.70 (0.13) | ||||

E24 | 5.61 (0.17) | 0.23 (0.04) | ||||

E25 | 6.32 (0.71) | 0.52 (0.20) |

For developing the geometric theory in sections 3 and 4, we formulate tree selection as a mathematical formulation known as

Model map: Visualization of ML estimates of probability distributions for the best 15 trees. The origin represents the star-shaped tree topology (obtained by reducing the internal branches to zero length). Sites of amino acid sequences

In

For developing the theory, we consider (^{m+1} and the identity variance matrix _{m+1}:

A region of interest such as tree and edge is denoted as _{all} regions _{all}, and we simply write

The problem of regions is well described by geometric quantities (

and β_{0} be the _{0} indicates the evidence for rejecting _{1} ∈ ℝ, and defined in (30).

Problem of regions. _{0} > 0 when _{0} ≤ 0 when

Geometric quantities β_{0} and β_{1} of regions for trees (T1, …, T105) and edges (E1, …, E25) are plotted in _{0} and β_{1} using the non-parametric bootstrap probabilities (section 6.1) with bootstrap replicates

Geometric quantities of regions (β_{0} and β_{1}) for trees and edges are estimated by the multiscale bootstrap method (section 3.4). The three types of _{0} and β_{1}, and their contour lines are drawn at

For simulating (4) from ^{*} from the bootstrap distribution (

and define bootstrap probability (BP) of ^{*} being included in the region

_{m+1}(_{m + 1}) is identical to the distribution of ^{*} in (5). An interesting consequence of the geometric theory of Efron and Tibshirani (

where ≃ indicates the

For understanding the formula (7), assume that _{1} = 0. Since we only have to look at the axis orthogonal to _{0} = ^{*} ~ _{1}. As seen in _{1} > 0 than β_{1} = 0, and BP becomes smaller.

BP of

The last expression also implies that the signed distance and the mean curvature of _{0} and −β_{1}, respectively; this relation is also obtained by reversing the sign of

Although _{0}(_{0}(_{0}} in which the signed distance is larger than the observed value β_{0} = β_{0}(

where the probability is calculated for _{0}(_{0}} is very similar to the shape of _{0} (shown as

where the last expression is obtained by substituting (−β_{0}, β_{1}) for (β_{0}, β_{1}) in (8). This formula computes AU from (β_{0}, β_{1}). An intuitive interpretation of (10) is explained in section 6.4.

In non-selective inference,

and the rejection probability increases as

Exchanging the roles of _{0}, −β_{1}), i.e., the geometric quantities of _{0}, β_{1}) in (10) as

If

In order to estimate β_{0} and β_{1} from bootstrap probabilities, we consider a generalization of (5) as

for a variance σ^{2} > 0, and define multiscale bootstrap probability of

where

Although our theory is based on the multivariate normal model, the actual implementation of the algorithm uses the non-parametric bootstrap probabilities in section 6.1. To fill the gap between the two models, we consider a non-linear transformation _{n} so that the multivariate normal model holds at least approximately for _{n} is given in (25) for phylogenetic inference. Surprisingly, a specification of _{n} is ^{2} =

For estimating β_{0} and β_{1}, we need to have a scaling law which explains how ^{−1} so that ^{2} = 1. ^{−1}, which amounts to signed distance _{1}σ (Shimodaira, _{0}, β_{1}) in (7), we obtain

For better illustrating how ^{2}, we define

We can estimate β_{0} and β_{1} as regression coefficients by fitting the linear model (16) in terms of σ^{2} to the observed values of non-parametric bootstrap probabilities (^{2} = −1 in the last expression of (16), meaning that AU corresponds to ^{2} should be positive in (15), we can think of negative σ^{2} in ^{2}.

Multiscale bootstrap for ^{2} = _{0} and β_{1} are estimated as the intercept and the slope, respectively. See section 6.5 for details.

In order to argue selective inference for the problem of regions, we have to specify the selection event. Let us consider a selective region _{0} > 0) is called as _{0} ≤ 0) is called as

Let us consider the outside mode by assuming that _{0} > 0. Recalling that

From the definition, _{0} > 0. This _{0}, β_{1}) by

where _{1}) for (β_{0}, β_{1}) in (8). An intuitive justification of (18) is explained in section 6.4.

For the outside mode of selective inference,

and the rejection probability increases as

Now we consider the inside mode by assuming that _{0} ≤ 0. SI of

For the inside mode of selective inference,

so that SI′ > 1 − α implies

We can compute SI from BP and AU. This will be useful for reanalyzing the results of previously published researches. Let us write

We can compute SI from β_{0} and β_{1} by (18) or (20). More directly, we may compute

In this section, the analytical procedure outlined in section 2 is used to determine relationships among human, mouse, and rabbit. The question is: Which of mouse or human is closer to rabbit? The traditional view (Novacek,

The results are shown below for the two test modes (inside and outside) as defined in section 4.1. The extent of multiplicity and selection bias depends on the number of regions under consideration, thus these numbers are considered for interpreting the results. The numbers of regions related to trees and edges are summarized in

The number of regions for trees and edges. The number of taxa is

_{select} |
1 | 3 | 104 | 22 |

_{true} |
104 | 22 | 1 | 3 |

_{all} |
105 | 25 | 105 | 25 |

In inside mode, the null hypothesis _{0} ≤ 0). This applies to the regions for T1, E1, E2, and E3, and they are _{0} is rejected by a test procedure, it is claimed that _{1} holds true. For convenience, the null hypothesis _{0} is said like E1 is not true, and the alternative hypothesis _{1} is said like E1 is true; then rejection of _{0} implies that E1 is true. This procedure looks unusual, but makes sense when both _{select}/_{all} ≈ 0 for many taxa, and non-selective tests may lead to many false positives because _{true}/_{all} ≈ 1. Therefore selective inference should be used in inside mode.

In outside mode, the null hypothesis _{0} > 0). This applies to the regions for T2, …, T105, and E4, …, E25, and they are _{0} is rejected by a test procedure, it is claimed that _{0} implies that T9 is not true. This is more or less a typical test procedure. Note that selection bias is minor in the sense that _{select}/_{all} ≈ 1 for many taxa, and non-selective tests may result in few false positives because _{true}/_{all} ≈ 0. Therefore selective inference is not much beneficial in outside mode.

In addition to _{0} is estimated correctly for all the trees and edges. The estimated β_{1} values are all positive, indicating the regions are convex. This is not surprising, because the regions are expressed as intersections of half spaces at least locally (

Now _{1} > 0, but BP is not guaranteed for controlling type-I error in a frequentist sense. The two inequalities (SI, BP < AU) are verified as relative positions of the contour lines at _{1}.

Next _{0} is specified in advance (Shimodaira, _{1} > 0, BP violates the type-I error, which results in overconfidence in non-rejected wrong trees. Therefore BP should be avoided in outside mode. Inequality AU < SI can be shown for _{select}/_{all} ≈ 1), while BP is very different from AU and SI for large β_{1}. It is arguable which of AU and SI is appropriate: AU is preferable to SI in tree selection (_{true} = 1), because the multiplicity of testing is controlled as _{true} ≥ 1 for edge selection, and SI does not fix it either. For testing edges in outside mode, AU may be used for screening purpose with a small α value such as α/_{true}.

We have developed a new method for computing selective inference

Non-parametric bootstrap is often time consuming for recomputing the maximum likelihood (ML) estimates for bootstrap replicates. Kishino et al. (_{t} is the site-pattern of amino acids at site _{t} from _{i}(_{i}) for

so that the log-likelihood value for tree T

where _{t} appears in ^{*} with

The non-parametric bootstrap probability of tree T^{5}. For each ^{*b} is computed by the method described above. Then we count the frequency that T

The non-parametric bootstrap probability of a edge is computed by summing BP(T

An example of the transformation

where _{n} is the variance matrix of ^{2} =

which gives another insight into the visualization of section 6.2, where the variance can be interpreted as the divergence between the two models; see Equation (27). This approximation holds well when the two predictive distributions

For representing the probability distribution of tree T_{i} in ℝ^{n} will represent locations of _{KL}(_{i}||_{j}) be the Kullback-Leibler divergence between the two distributions. For sufficiently small ^{n} approximates

for non-nested models (Shimodaira, _{0} is nested in _{i}, it becomes

Three versions the visualization of probability distributions for the best 15 trees drawn using different sets of models.

For dimensionality reduction, we have to specify the origin ^{n} and consider vectors _{i}: = _{i}−_{1}, …, _{15}), we obtain the visualization of _{i} as the axes (red arrows) of biplot in

For computing the “data point” _{i}: = _{106+i}, _{0} for computing _{i} = _{i} − _{0} and _{i} = _{i} − _{0}. There is hierarchy of models: _{0} is the submodel nested in all the other models, and _{1}, _{2}, _{3}, for example, are submodels of _{1} (T1 includes E1, E2, E3). By combining these non-nested models, we can reconstruct a comprehensive model in which all the other models are nested as submodels (Shimodaira, _{1}_{1} + ··· + β_{25}_{25} + ϵ of multiple regression from submodels _{1}_{1} + ϵ, …, _{25}_{25} + ϵ. Thus we call it as “full model” in this paper, and the ML estimate of the full model is indicated as the data point

For the visualization of the best 15 trees, we may use only _{1}, …, _{11}, because they include E1 and two more edges from E2, …, E11. In _{1}, …, _{15} are projected to the space orthogonal to _{X}, so that the plot becomes the “top-view” of _{X} being at the origin.

In _{1}, …, _{15}) and _{X} is applied before PCA.

For expressing the shape of the region ^{m + 1} with ^{m},

where ^{m}. We can choose the coordinates so that _{0}) (i.e., _{0}), and _{i}|_{0} = 0, _{0}. The mean curvature of surface

which is interpreted as the trace of the hessian matrix of _{1} ≥ 0, whereas concave _{1} ≤ 0. In particular, β_{1} = 0 when

Since the transformation ^{−1/2}. For keeping the variance constant in (4), we actually magnifying the space by the factor ^{1/2}, meaning that the boundary surface ^{−1/2}.

Here we explain the problem of regions in terms of the

Ideal _{0}(_{1}(_{1}(_{1}(_{1}, we treat β_{1}(

Let us look at the

The 1-dimensional model (1) is now equivalent to (31). The null hypothesis is also equivalent:

We have used thirteen σ^{2} values from 1/9 to 9 (equally spaced in log-scale). This range is relatively large, and we observe a slight deviation from the linear model _{0} and β_{1} are estimated from the tangent line to the fitted curve of ^{2} = 1. In ^{2} = −1. Shimodaira (^{2} = 1 as a generalization of the tangent line for improving the accuracy of AU and SI.

In the implementation of ^{2} values (ten σ^{−2} values: 0.5, 0.6, …, 1.4). Only the linear model _{0} and β_{1} should be very close to those estimated from the tangent line described above. An advantage of using wider range of σ^{2} in _{0} and β_{1} will become smaller.

Let

where geometric quantities β_{0}, β_{1} are defined for the regions

Here we explain why we have considered the special case of _{0} unless

where

Let us see how these two

and it is very different from SI of E2 conditioned on selecting E2

where _{0} than

The regions _{all} correspond to trees or edges. In inside and outside modes, the number of total regions is _{all} = 105 for trees and _{all} = 25 for edges when the number of taxa is _{select}. In inside mode, regions with _{select} = 1 for trees and _{select} = _{all} minus that for inside mode; _{select} = _{all} − 1 = 104 for trees and _{select} = _{all} − (_{true}. The null hypothesis holds true when _{true} is the same as the number of regions with _{true} = _{all} − _{select} for both cases.

Selective inference is considered for the variable selection of regression analysis. Here, we deal with prostate cancer data (Stamey et al.,

The goal is to provide the valid selective inference for the partial regression coefficients of the selected variables by lasso (Tibshirani, ^{n} and τ > 0. Let _{i} be the _{i} (^{2}, we can apply the multiscale bootstrap method described in section 4 for the selective inference in the regression problem. Here, we note that the target of the inference is the true partial regression coefficients:

where ^{n×p} is the design matrix. We compute four types of intervals with confidence level 1 − α = 0.95 for selected variable

Note that the selection event of variable ^{n}, and thus, according to the polyhedral lemma (Lee et al.,

We set λ = 10 as the penalty parameter of lasso, and the following model and signs were selected:

The confidence intervals are shown in

HS and YT developed the theory of selective inference. HS programmed the multiscale bootstrap software and conducted the phylogenetic analysis. YT conducted the lasso analysis. HS wrote the manuscript. All authors have approved the final version of the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors appreciate the feedback from the audience of seminar talk of HS at Department of Statistics, Stanford University. The authors are grateful to Masami Hasegawa for his insightful comments on phylogenetic analysis of mammal species.