Regularization hyperparameter choice in sparsesurv

We established before that sparsity and performance of a distilled model can be affected significantly by the choice of the teacher model class. In non-distilled sparse models, such as the Lasso, performance is also highly dependent on the degree of sparsity. In distilled models this seems to be much less the case [1, 2].

Still, there are situations in which one may wish to choose a specific sparsity level, or err on the side of choosing a higher/lower level of sparsity.

Baseline

[3]:
import pandas as pd
from sparsesurv.utils import transform_survival
from sklearn.decomposition import PCA
from sparsesurv._base import KDSurv
from sparsesurv.cv import KDPHElasticNetCV, KDEHMultiTaskLassoCV, KDAFTElasticNetCV
from sparsesurv.utils import transform_survival
from sklearn.pipeline import make_pipeline
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sparsesurv.aft import AFT
from sparsesurv.eh import EH
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("https://zenodo.org/records/10027434/files/OV_data_preprocessed.csv?download=1")
X = df.iloc[:, 3:].to_numpy()
y = transform_survival(time=df.OS_days.values, event=df.OS.values)

X_train = X[:200]
X_test = X[200:]
y_train = y[:200]
y_test = y[200:]

pipe_cox_efron = KDSurv(
            teacher=make_pipeline(
                StandardScaler(),
                PCA(n_components=16),
                CoxPHSurvivalAnalysis(ties="efron"),
            ),
            student=make_pipeline(
                StandardScaler(),
                KDPHElasticNetCV(
                    tie_correction="efron",
                    l1_ratio=0.9,
                    eps=0.01,
                    n_alphas=100,
                    cv=5,
                    stratify_cv=True,
                    seed=None,
                    shuffle_cv=False,
                    cv_score_method="linear_predictor",
                    n_jobs=1,
                    alpha_type="min",
                ),
            ),
        )

pipe_cox_efron.fit(X_train, y_train)
[4]:
import numpy as np
np.sum(pipe_cox_efron.student[1].coef_ != 0.0)
[4]:
169
[5]:
from sksurv.metrics import concordance_index_censored
concordance_index_censored(y_test["event"], y_test["time"], pipe_cox_efron.predict(X_test))[0]
[5]:
0.5244498777506112

By default, sparsesurv fits models with no limit on the number of non-zero coefficients, beyond what is implied by the regularizers. In addition, sparsesurv uses alpha_type="min" by default, thus choosing the regularization hyperparameter which maximizes the score, which will often be rather high (as evidenced by 168 non-zero coefficients in the ovarian cancer example above).

Explicitly limiting the number of non-zero coefficients

One alternative is explicitly limiting the number of non-zero coefficients. Below, we select the regularization hyperparameter with the maximum that score that has 50 non-zero coefficients or less.

[6]:
pipe_cox_efron = KDSurv(
            teacher=make_pipeline(
                StandardScaler(),
                PCA(n_components=16),
                CoxPHSurvivalAnalysis(ties="efron"),
            ),
            student=make_pipeline(
                StandardScaler(),
                KDPHElasticNetCV(
                    tie_correction="efron",
                    l1_ratio=0.9,
                    eps=0.01,
                    n_alphas=100,
                    cv=5,
                    stratify_cv=True,
                    seed=None,
                    shuffle_cv=False,
                    cv_score_method="linear_predictor",
                    n_jobs=1,
                    alpha_type="min",
                    max_coef=50
                ),
            ),
        )

pipe_cox_efron.fit(X_train, y_train)
[7]:
np.sum(pipe_cox_efron.student[1].coef_ != 0.0)
[7]:
47
[8]:
concordance_index_censored(y_test["event"], y_test["time"], pipe_cox_efron.predict(X_test))[0]
[8]:
0.5138549307253464

While explicitly setting the desired degree of sparsity can work well, one may also want a degree of sparsity to be chosen that finds a good trade-off between performance and sparsity.

Automatically trading-off between sparsity and performance

For this purpose, sparsesurv implements two alternative rules, instead of choosing the regularization hyperparameter that maximizes the score:

1. alpha_type="1se" chooses the highest regularization hyperparameter that is within one standard error of the
mean of the best score [3].

2. alpha_type="pcvl" chooses a regularization hyperparameter less sparse than "1se" but more sparse than "min"
via a penalization approach [4].

Importantly, alpha_type=1se requires cv_score_method != linear_predictor, since otherwise calculating a mean score is impossible.

[13]:
pipe_cox_efron = KDSurv(
            teacher=make_pipeline(
                StandardScaler(),
                PCA(n_components=16),
                CoxPHSurvivalAnalysis(ties="efron"),
            ),
            student=make_pipeline(
                StandardScaler(),
                KDPHElasticNetCV(
                    tie_correction="efron",
                    l1_ratio=0.9,
                    eps=0.01,
                    n_alphas=100,
                    cv=5,
                    stratify_cv=True,
                    seed=None,
                    shuffle_cv=False,
                    cv_score_method="vvh",
                    n_jobs=1,
                    alpha_type="1se"
                ),
            ),
        )

pipe_cox_efron.fit(X_train, y_train)
[14]:
np.sum(pipe_cox_efron.student[1].coef_ != 0.0)
[14]:
0
[15]:
concordance_index_censored(y_test["event"], y_test["time"], pipe_cox_efron.predict(X_test))[0]
[15]:
0.5
[16]:
pipe_cox_efron = KDSurv(
            teacher=make_pipeline(
                StandardScaler(),
                PCA(n_components=16),
                CoxPHSurvivalAnalysis(ties="efron"),
            ),
            student=make_pipeline(
                StandardScaler(),
                KDPHElasticNetCV(
                    tie_correction="efron",
                    l1_ratio=0.9,
                    eps=0.01,
                    n_alphas=100,
                    cv=5,
                    stratify_cv=True,
                    seed=None,
                    shuffle_cv=False,
                    cv_score_method="linear_predictor",
                    n_jobs=1,
                    alpha_type="pcvl"
                ),
            ),
        )

pipe_cox_efron.fit(X_train, y_train)
[17]:
np.sum(pipe_cox_efron.student[1].coef_ != 0.0)
[17]:
51
[18]:
concordance_index_censored(y_test["event"], y_test["time"], pipe_cox_efron.predict(X_test))[0]
[18]:
0.5142624286878565

As seen above, the downside of automatic selection methods, is that they may select completely sparse models if the prediction is not much better than chance, as is the case here.

References

[1] David Wissel, Nikita Janakarajan, Daniel Rowson, Julius Schulte, Xintian Yuan, Valentina Boeva. “sparsesurv: Sparse survival models via knowledge distillation.” (2023, under review).

[2] Paul, Debashis, et al. ““Preconditioning” for feature selection and regression in high-dimensional problems.” (2008): 1595-1618.

[3] Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: Springer, 2009.

[4] Ternès, Nils, Federico Rotolo, and Stefan Michiels. “Empirical extensions of the lasso penalty to reduce the false discovery rate in high‐dimensional Cox regression models.” Statistics in medicine 35.15 (2016): 2561-2573.