{ "cells": [ { "cell_type": "markdown", "id": "9e4d2938", "metadata": {}, "source": [ "# Regularization hyperparameter choice in `sparsesurv`" ] }, { "cell_type": "markdown", "id": "24658b42", "metadata": {}, "source": [ "We established before that sparsity and performance of a distilled model can be affected significantly by the choice of the teacher model class. In non-distilled sparse models, such as the Lasso, performance is also highly dependent on the degree of sparsity. In distilled models this seems to be much less the case [1, 2].\n", "\n", "Still, there are situations in which one may wish to choose a specific sparsity level, or err on the side of choosing a higher/lower level of sparsity." ] }, { "cell_type": "markdown", "id": "23bb301e", "metadata": {}, "source": [ "## Baseline" ] }, { "cell_type": "code", "execution_count": 3, "id": "2ab46c74", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sparsesurv.utils import transform_survival\n", "from sklearn.decomposition import PCA\n", "from sparsesurv._base import KDSurv\n", "from sparsesurv.cv import KDPHElasticNetCV, KDEHMultiTaskLassoCV, KDAFTElasticNetCV\n", "from sparsesurv.utils import transform_survival\n", "from sklearn.pipeline import make_pipeline\n", "from sksurv.linear_model import CoxPHSurvivalAnalysis\n", "from sparsesurv.aft import AFT\n", "from sparsesurv.eh import EH\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "df = pd.read_csv(\"https://zenodo.org/records/10027434/files/OV_data_preprocessed.csv?download=1\")\n", "X = df.iloc[:, 3:].to_numpy()\n", "y = transform_survival(time=df.OS_days.values, event=df.OS.values)\n", "\n", "X_train = X[:200]\n", "X_test = X[200:]\n", "y_train = y[:200]\n", "y_test = y[200:]\n", "\n", "pipe_cox_efron = KDSurv(\n", " teacher=make_pipeline(\n", " StandardScaler(),\n", " PCA(n_components=16),\n", " CoxPHSurvivalAnalysis(ties=\"efron\"),\n", " ),\n", " student=make_pipeline(\n", " StandardScaler(),\n", " KDPHElasticNetCV(\n", " tie_correction=\"efron\",\n", " l1_ratio=0.9,\n", " eps=0.01,\n", " n_alphas=100,\n", " cv=5,\n", " stratify_cv=True,\n", " seed=None,\n", " shuffle_cv=False,\n", " cv_score_method=\"linear_predictor\",\n", " n_jobs=1,\n", " alpha_type=\"min\",\n", " ),\n", " ),\n", " )\n", "\n", "pipe_cox_efron.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 4, "id": "5ec0004f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "169" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.sum(pipe_cox_efron.student[1].coef_ != 0.0)" ] }, { "cell_type": "code", "execution_count": 5, "id": "6b8da529", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5244498777506112" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sksurv.metrics import concordance_index_censored\n", "concordance_index_censored(y_test[\"event\"], y_test[\"time\"], pipe_cox_efron.predict(X_test))[0]" ] }, { "cell_type": "markdown", "id": "e5c5a5c6", "metadata": {}, "source": [ "By default, `sparsesurv` fits models with no limit on the number of non-zero coefficients, beyond what is implied by the regularizers. In addition, `sparsesurv` uses `alpha_type=\"min\"` by default, thus choosing the regularization hyperparameter which maximizes the score, which will often be rather high (as evidenced by 168 non-zero coefficients in the ovarian cancer example above).\n" ] }, { "cell_type": "markdown", "id": "1b4cfde7", "metadata": {}, "source": [ "## Explicitly limiting the number of non-zero coefficients" ] }, { "cell_type": "markdown", "id": "c38879c0", "metadata": {}, "source": [ "One alternative is explicitly limiting the number of non-zero coefficients. Below, we select the regularization hyperparameter with the maximum that score that has 50 non-zero coefficients or less." ] }, { "cell_type": "code", "execution_count": 6, "id": "cfffc0bd", "metadata": {}, "outputs": [], "source": [ "pipe_cox_efron = KDSurv(\n", " teacher=make_pipeline(\n", " StandardScaler(),\n", " PCA(n_components=16),\n", " CoxPHSurvivalAnalysis(ties=\"efron\"),\n", " ),\n", " student=make_pipeline(\n", " StandardScaler(),\n", " KDPHElasticNetCV(\n", " tie_correction=\"efron\",\n", " l1_ratio=0.9,\n", " eps=0.01,\n", " n_alphas=100,\n", " cv=5,\n", " stratify_cv=True,\n", " seed=None,\n", " shuffle_cv=False,\n", " cv_score_method=\"linear_predictor\",\n", " n_jobs=1,\n", " alpha_type=\"min\",\n", " max_coef=50\n", " ),\n", " ),\n", " )\n", "\n", "pipe_cox_efron.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 7, "id": "b0b9c18b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "47" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(pipe_cox_efron.student[1].coef_ != 0.0)" ] }, { "cell_type": "code", "execution_count": 8, "id": "03ce2c44", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5138549307253464" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "concordance_index_censored(y_test[\"event\"], y_test[\"time\"], pipe_cox_efron.predict(X_test))[0]" ] }, { "cell_type": "markdown", "id": "645855c3", "metadata": {}, "source": [ "While explicitly setting the desired degree of sparsity can work well, one may also want a degree of sparsity to be chosen that finds a good trade-off between performance and sparsity." ] }, { "cell_type": "markdown", "id": "110aad98", "metadata": {}, "source": [ "## Automatically trading-off between sparsity and performance" ] }, { "cell_type": "markdown", "id": "8dc0a54a", "metadata": {}, "source": [ "For this purpose, `sparsesurv` implements two alternative rules, instead of choosing the regularization hyperparameter that maximizes the score:\n", "\n", " 1. alpha_type=\"1se\" chooses the highest regularization hyperparameter that is within one standard error of the \n", " mean of the best score [3].\n", " \n", " 2. alpha_type=\"pcvl\" chooses a regularization hyperparameter less sparse than \"1se\" but more sparse than \"min\"\n", " via a penalization approach [4].\n", " \n", "Importantly, `alpha_type=1se` requires `cv_score_method != linear_predictor`, since otherwise calculating a mean score is impossible." ] }, { "cell_type": "code", "execution_count": 13, "id": "ef8ee7a8", "metadata": {}, "outputs": [], "source": [ "pipe_cox_efron = KDSurv(\n", " teacher=make_pipeline(\n", " StandardScaler(),\n", " PCA(n_components=16),\n", " CoxPHSurvivalAnalysis(ties=\"efron\"),\n", " ),\n", " student=make_pipeline(\n", " StandardScaler(),\n", " KDPHElasticNetCV(\n", " tie_correction=\"efron\",\n", " l1_ratio=0.9,\n", " eps=0.01,\n", " n_alphas=100,\n", " cv=5,\n", " stratify_cv=True,\n", " seed=None,\n", " shuffle_cv=False,\n", " cv_score_method=\"vvh\",\n", " n_jobs=1,\n", " alpha_type=\"1se\"\n", " ),\n", " ),\n", " )\n", "\n", "pipe_cox_efron.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 14, "id": "ffc98f9d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(pipe_cox_efron.student[1].coef_ != 0.0)" ] }, { "cell_type": "code", "execution_count": 15, "id": "11d2b9de", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "concordance_index_censored(y_test[\"event\"], y_test[\"time\"], pipe_cox_efron.predict(X_test))[0]" ] }, { "cell_type": "code", "execution_count": 16, "id": "cfa745c1", "metadata": {}, "outputs": [], "source": [ "pipe_cox_efron = KDSurv(\n", " teacher=make_pipeline(\n", " StandardScaler(),\n", " PCA(n_components=16),\n", " CoxPHSurvivalAnalysis(ties=\"efron\"),\n", " ),\n", " student=make_pipeline(\n", " StandardScaler(),\n", " KDPHElasticNetCV(\n", " tie_correction=\"efron\",\n", " l1_ratio=0.9,\n", " eps=0.01,\n", " n_alphas=100,\n", " cv=5,\n", " stratify_cv=True,\n", " seed=None,\n", " shuffle_cv=False,\n", " cv_score_method=\"linear_predictor\",\n", " n_jobs=1,\n", " alpha_type=\"pcvl\"\n", " ),\n", " ),\n", " )\n", "\n", "pipe_cox_efron.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 17, "id": "2b124255", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "51" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(pipe_cox_efron.student[1].coef_ != 0.0)" ] }, { "cell_type": "code", "execution_count": 18, "id": "c2fb2b92", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5142624286878565" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "concordance_index_censored(y_test[\"event\"], y_test[\"time\"], pipe_cox_efron.predict(X_test))[0]" ] }, { "cell_type": "markdown", "id": "86fd7829", "metadata": {}, "source": [ "As seen above, the downside of automatic selection methods, is that they may select completely sparse models if the prediction is not much better than chance, as is the case here." ] }, { "cell_type": "markdown", "id": "039220e6", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "id": "a1f18949", "metadata": {}, "source": [ "[1] David Wissel, Nikita Janakarajan, Daniel Rowson, Julius Schulte, Xintian Yuan, Valentina Boeva. \"sparsesurv: Sparse survival models via knowledge distillation.\" (2023, under review).\n", "\n", "[2] Paul, Debashis, et al. \"“Preconditioning” for feature selection and regression in high-dimensional problems.\" (2008): 1595-1618.\n", "\n", "\n", "[3] Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: Springer, 2009.\n", "\n", "[4] Ternès, Nils, Federico Rotolo, and Stefan Michiels. \"Empirical extensions of the lasso penalty to reduce the false discovery rate in high‐dimensional Cox regression models.\" Statistics in medicine 35.15 (2016): 2561-2573." ] }, { "cell_type": "markdown", "id": "9a675767", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 5 }