{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b3cb023c",
   "metadata": {},
   "source": [
    "## Survival analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aedae0d7",
   "metadata": {},
   "source": [
    "sparsesurv [1] operates on survival analysis data. Below, we quote the notation from the supplementary section of our manuscript to ensure we are on the same page in terms of notation and language.\n",
    "\n",
    "> In particular survival concerns the analysis and modeling of a non-negative random variable $T > 0$, that is used to model the time until an event of interest occurs. In observational survival datasets, we let $T_i$ and $C_i$ denote the event and right-censoring times of patient $i$. In right-censored survival analysis, we observe triplets $(x_i, \\delta_i, O_i)$, where $O_i = \\text{min}(T_i, C_i)$ and $\\delta_i = {1}(T_i \\leq C_i)$. Throughout we assume conditionally independent censoring and non-informative censoring. That is, $T \\perp\\!\\!\\!\\!\\perp C \\mid X$ and $C$ may not be a function of any of the parameters of $T$ \\citep{kalbfleisch2011statistical}. Further, let $\\lambda$ denote the hazard function, $\\Lambda$ be the cumulative hazard function, and $S(t) = 1 - F(t)$ be the survival function, where $F(t)$ denotes the cumulative distribution function. We let $\\tilde T$ be the set of unique, ascending-ordered death times. $R_i$ is the risk set at time $i$, that is, $R(i) = \\{j: O_j \\geq O_i\\}$. $D_i$ denotes the death set at time $i$, $D(i) = \\{j: O_j = i \\land \\delta_i = 1\\}$.\n",
    "\n",
    "For now, *sparsesurv* operats solely on right censored data, although we may consider an extension to other censoring and truncation schemes, if there is interest. We now briefly show an example right-censored survival dataset available in *scikit-survival* [4], another Python package for survival analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "abecf90f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sksurv.datasets import load_flchain\n",
    "X, y = load_flchain()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e92dd399",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>chapter</th>\n",
       "      <th>creatinine</th>\n",
       "      <th>flc.grp</th>\n",
       "      <th>kappa</th>\n",
       "      <th>lambda</th>\n",
       "      <th>mgus</th>\n",
       "      <th>sample.yr</th>\n",
       "      <th>sex</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>97.0</td>\n",
       "      <td>Circulatory</td>\n",
       "      <td>1.7</td>\n",
       "      <td>10</td>\n",
       "      <td>5.700</td>\n",
       "      <td>4.860</td>\n",
       "      <td>no</td>\n",
       "      <td>1997</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>92.0</td>\n",
       "      <td>Neoplasms</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1</td>\n",
       "      <td>0.870</td>\n",
       "      <td>0.683</td>\n",
       "      <td>no</td>\n",
       "      <td>2000</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>94.0</td>\n",
       "      <td>Circulatory</td>\n",
       "      <td>1.4</td>\n",
       "      <td>10</td>\n",
       "      <td>4.360</td>\n",
       "      <td>3.850</td>\n",
       "      <td>no</td>\n",
       "      <td>1997</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>92.0</td>\n",
       "      <td>Circulatory</td>\n",
       "      <td>1.0</td>\n",
       "      <td>9</td>\n",
       "      <td>2.420</td>\n",
       "      <td>2.220</td>\n",
       "      <td>no</td>\n",
       "      <td>1996</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>93.0</td>\n",
       "      <td>Circulatory</td>\n",
       "      <td>1.1</td>\n",
       "      <td>6</td>\n",
       "      <td>1.320</td>\n",
       "      <td>1.690</td>\n",
       "      <td>no</td>\n",
       "      <td>1996</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7869</th>\n",
       "      <td>52.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>6</td>\n",
       "      <td>1.210</td>\n",
       "      <td>1.610</td>\n",
       "      <td>no</td>\n",
       "      <td>1995</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7870</th>\n",
       "      <td>52.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.8</td>\n",
       "      <td>1</td>\n",
       "      <td>0.858</td>\n",
       "      <td>0.581</td>\n",
       "      <td>no</td>\n",
       "      <td>1999</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7871</th>\n",
       "      <td>54.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8</td>\n",
       "      <td>1.700</td>\n",
       "      <td>1.720</td>\n",
       "      <td>no</td>\n",
       "      <td>2002</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7872</th>\n",
       "      <td>53.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9</td>\n",
       "      <td>1.710</td>\n",
       "      <td>2.690</td>\n",
       "      <td>no</td>\n",
       "      <td>1995</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7873</th>\n",
       "      <td>50.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.7</td>\n",
       "      <td>4</td>\n",
       "      <td>1.190</td>\n",
       "      <td>1.250</td>\n",
       "      <td>no</td>\n",
       "      <td>1998</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>7874 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       age      chapter  creatinine flc.grp  kappa  lambda mgus sample.yr sex\n",
       "0     97.0  Circulatory         1.7      10  5.700   4.860   no      1997   F\n",
       "1     92.0    Neoplasms         0.9       1  0.870   0.683   no      2000   F\n",
       "2     94.0  Circulatory         1.4      10  4.360   3.850   no      1997   F\n",
       "3     92.0  Circulatory         1.0       9  2.420   2.220   no      1996   F\n",
       "4     93.0  Circulatory         1.1       6  1.320   1.690   no      1996   F\n",
       "...    ...          ...         ...     ...    ...     ...  ...       ...  ..\n",
       "7869  52.0          NaN         1.0       6  1.210   1.610   no      1995   F\n",
       "7870  52.0          NaN         0.8       1  0.858   0.581   no      1999   F\n",
       "7871  54.0          NaN         NaN       8  1.700   1.720   no      2002   F\n",
       "7872  53.0          NaN         NaN       9  1.710   2.690   no      1995   F\n",
       "7873  50.0          NaN         0.7       4  1.190   1.250   no      1998   F\n",
       "\n",
       "[7874 rows x 9 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fb9c403",
   "metadata": {},
   "source": [
    "The design matrix $X$ looks the same as it would in other modeling settings, such as regression or classification, and thus does not require any special treatment from a modeling point of view."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e02a6556",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([( True,   85.), ( True, 1281.), ( True,   69.), ...,\n",
       "       (False, 2507.), (False, 4982.), (False, 3995.)],\n",
       "      dtype=[('death', '?'), ('futime', '<f8')])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ccb8bca",
   "metadata": {},
   "source": [
    "The target $y$ looks weird upon first glance however, as it contains two elements for each sample. These correspond exactly to $O_i$ and $\\delta_i$ in our notation section above and respectively represent the censoring indicator and the observed time. Right-censored survival data is generally represented in structured array as is shown here, having one element for the censoring indicator and the observed time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1f81ab66",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ True,  True,  True, ..., False, False, False])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y[\"death\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "1cfd4af6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, ..., 0, 0, 0])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(y[\"death\"]).astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03f839f5",
   "metadata": {},
   "source": [
    "While the number of covariates relative to the number of is quite good here (only 9 variables for 7,874 samples), a very common setting in (cancer) survival anaylsis is one where the number of available covariates is much larger than the number of available samples (i.e., $p >> n$). This is exactly the setting that *sparsesurv* is designed for. *sparsesurv* is based on knowledge distillation [5], which is also referred to as preconditioning [2] or reference models in statistics [3]. Thus, we briefly introduce the idea of knowledge distillation next."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a680cb51",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(7874, 9)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65210e89",
   "metadata": {},
   "source": [
    "## Knowledge distillation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6493105e",
   "metadata": {},
   "source": [
    "The original idea of knowledge distillation was not directly related interpretabiltiy or feature selection. We note however, that the idea of using something akin to knowledge distillation was used and proposed in statistics before knowledge distillation itself, under the name of preconditioning. We will continue to use the name knowledge distillation since it may be more familiar to readers in the machine learning community."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "720f749a",
   "metadata": {},
   "source": [
    "The actual process of knowledge distillation proceeds in two steps:\n",
    "\n",
    "    1. Fit a teacher model that can approximate the target of interest (very) well\n",
    "    \n",
    "    2. Fit a student model on the predictions of the teacher model, hoping that the teacher model acts as a kind of noise filter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22ee2d78",
   "metadata": {},
   "source": [
    "We illustrate how this process can be adapted to survival analysis. Please note that this running example will continue to use `scikit-survival`, before we move on to high-dimensional data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f2e5b94e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;standardscaler&#x27;, StandardScaler()),\n",
       "                (&#x27;coxphsurvivalanalysis&#x27;, CoxPHSurvivalAnalysis())])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[(&#x27;standardscaler&#x27;, StandardScaler()),\n",
       "                (&#x27;coxphsurvivalanalysis&#x27;, CoxPHSurvivalAnalysis())])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">StandardScaler</label><div class=\"sk-toggleable__content\"><pre>StandardScaler()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">CoxPHSurvivalAnalysis</label><div class=\"sk-toggleable__content\"><pre>CoxPHSurvivalAnalysis()</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('standardscaler', StandardScaler()),\n",
       "                ('coxphsurvivalanalysis', CoxPHSurvivalAnalysis())])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sksurv.ensemble import RandomSurvivalForest\n",
    "from sksurv.linear_model import CoxPHSurvivalAnalysis\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sksurv.metrics import concordance_index_censored\n",
    "\n",
    "X_train = X.iloc[:5000, [0, 3, 4]]\n",
    "X_test = X.iloc[5000:, [0, 3, 4]]\n",
    "y_train = y[:5000]\n",
    "y_test = y[5000:]\n",
    "\n",
    "teacher_pipe = make_pipeline(StandardScaler(), RandomSurvivalForest())\n",
    "student_pipe = make_pipeline(StandardScaler(), LinearRegression())\n",
    "baseline_pipe = make_pipeline(StandardScaler(), CoxPHSurvivalAnalysis())\n",
    "\n",
    "teacher_pipe.fit(X_train, y_train)\n",
    "student_pipe.fit(X_train, teacher_pipe.predict(X_train))\n",
    "baseline_pipe.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "63f20333",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5677982023489915"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "concordance_index_censored(y_test[\"death\"], y_test[\"futime\"], teacher_pipe.predict(X_test))[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9d17ab57",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6479734724443875"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "concordance_index_censored(y_test[\"death\"], y_test[\"futime\"], student_pipe.predict(X_test))[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "63bd79fd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6444986225266496"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "concordance_index_censored(y_test[\"death\"], y_test[\"futime\"], baseline_pipe.predict(X_test))[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8bfd176",
   "metadata": {},
   "source": [
    "Interestingly, in this example, the student performance (as measured by Harrell's concordance) was slightly higher than the baseline, despite the teacher performing quite bad. There is ongoing research in the ML community along these lines. For us, however, all that matters is that knowledge distillation works for survival analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a83a03bd",
   "metadata": {},
   "source": [
    "## Minimal example of *sparsesurv*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a0fe82c",
   "metadata": {},
   "source": [
    "Lastly, we give a brief example of usage of *sparsesurv*. If you are interested in using *sparsesurv* on your own data, please consult the documentation or the more specific user guides linked above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "2567a022",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sparsesurv.utils import transform_survival\n",
    "df = pd.read_csv(\"https://zenodo.org/records/10027434/files/OV_data_preprocessed.csv?download=1\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "c62c8211",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df.iloc[:, 3:].to_numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "58b8ff48",
   "metadata": {},
   "outputs": [],
   "source": [
    "y = transform_survival(time=df.OS_days.values, event=df.OS.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "a87a3470",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(302, 19076)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "3aa29fb9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([( True,  304.), ( True,   24.), (False,  576.), (False, 1207.),\n",
       "       ( True,  676.)], dtype=[('event', '?'), ('time', '<f8')])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y[: 5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e0fff1d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "from sparsesurv._base import KDSurv\n",
    "from sparsesurv.cv import KDPHElasticNetCV\n",
    "from sparsesurv.utils import transform_survival\n",
    "\n",
    "pipe = KDSurv(\n",
    "            teacher=make_pipeline(\n",
    "                StandardScaler(),\n",
    "                PCA(n_components=16),\n",
    "                CoxPHSurvivalAnalysis(ties=\"efron\"),\n",
    "            ),\n",
    "            student=make_pipeline(\n",
    "                StandardScaler(),\n",
    "                KDPHElasticNetCV(\n",
    "                    tie_correction=\"efron\",\n",
    "                    l1_ratio=0.9,\n",
    "                    eps=0.01,\n",
    "                    n_alphas=100,\n",
    "                    cv=5,\n",
    "                    stratify_cv=True,\n",
    "                    seed=None,\n",
    "                    shuffle_cv=False,\n",
    "                    cv_score_method=\"linear_predictor\",\n",
    "                    n_jobs=1,\n",
    "                    alpha_type=\"min\",\n",
    "                ),\n",
    "            ),\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "b71b2ea4",
   "metadata": {},
   "outputs": [],
   "source": [
    "pipe.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32434a86",
   "metadata": {},
   "source": [
    "Now, we can easily check how many non-zero coefficients the fitted model has, or get predictions on the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "c9c2a171",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "260"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "np.sum(pipe.student[1].coef_ != 0.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "1b19a484",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 2.82819899e-01,  4.66079463e-01,  1.15606183e-01,  5.02952325e-01,\n",
       "        4.17393777e-02, -4.11573164e-01,  7.60555538e-01,  4.34577104e-01,\n",
       "        1.14479065e-01,  3.75259090e-02, -1.34671115e-02,  1.61175017e-01,\n",
       "        1.00036148e-01,  2.40373951e-01, -3.25340021e-01, -8.67875954e-01,\n",
       "        7.96211926e-01, -1.83116380e-01,  1.07281533e-02, -4.77575349e-02,\n",
       "       -1.43752928e-01,  4.76747822e-01, -1.28454611e-02, -2.77780752e-01,\n",
       "        1.18303704e-01,  7.13485684e-01, -9.06133817e-02, -2.28805873e-01,\n",
       "       -1.85240257e-01, -1.88895164e-01,  1.79877864e-02, -4.40709840e-01,\n",
       "       -3.33191668e-02, -3.47169497e-01, -5.41839603e-02, -7.00115696e-01,\n",
       "       -4.83739189e-01,  1.49624486e-01,  2.72062364e-01,  7.45285449e-01,\n",
       "       -7.67869829e-02, -1.82008113e-01, -6.69662687e-02,  8.93221182e-02,\n",
       "        2.10881649e-01, -5.40481656e-01, -6.76554743e-02,  4.59392323e-02,\n",
       "       -1.14925641e-01,  1.08297960e-01, -1.48445361e-01,  3.92129238e-01,\n",
       "        3.46609042e-04,  1.68584073e-01,  1.64706486e-01,  2.00876547e-01,\n",
       "       -4.59086265e-01, -8.81797220e-02, -3.05503196e-01, -1.06486223e+00,\n",
       "       -8.44201513e-01,  3.24547683e-01, -1.86346579e-01, -1.51631340e-01,\n",
       "        1.79803120e-01,  1.06390383e-01, -6.11439908e-01,  6.71797157e-02,\n",
       "       -6.46251752e-01,  3.26988764e-01,  1.07586534e-01,  3.68125671e-02,\n",
       "       -3.73012562e-01,  5.57874420e-02,  3.79215839e-01, -2.26237176e-01,\n",
       "       -1.29248351e-02,  1.22046107e-01,  3.74760349e-01,  5.23166206e-01,\n",
       "        3.03795116e-01, -7.36812630e-01,  3.94627049e-01,  4.25870853e-02,\n",
       "        2.98564610e-02,  2.95674650e-01, -2.19203239e-01, -5.11737562e-01,\n",
       "       -5.24482896e-02, -4.64111839e-03, -3.21799909e-01, -7.14669688e-01,\n",
       "        5.84417412e-01, -2.90476660e-01,  1.78423635e-01, -2.39300019e-01,\n",
       "        1.80786722e-01, -3.34470179e-02, -2.56003735e-01, -5.37086393e-02,\n",
       "       -4.43376994e-01,  3.19931551e-01,  1.97498674e-01, -2.61924192e-02,\n",
       "        1.33033208e-01, -6.95678071e-02, -7.24221455e-02,  5.40122505e-01,\n",
       "        2.65075056e-01, -9.89361408e-01,  6.19007678e-02,  2.07408942e-01,\n",
       "        6.82609874e-02, -6.37693151e-01, -3.20701807e-01, -6.57601954e-01,\n",
       "        1.24781707e-01,  1.89385766e-01,  2.85797247e-01, -5.92343166e-02,\n",
       "        3.80218635e-01,  5.44106400e-01,  6.82875840e-01, -3.90579509e-01,\n",
       "       -8.42389835e-02,  7.58434789e-01, -8.61379695e-02,  7.01730243e-01,\n",
       "       -1.48692814e-01,  7.07359015e-02, -6.76859547e-02, -4.46045375e-02,\n",
       "        4.81015185e-01, -1.98355534e-01, -1.59107820e-01, -3.26734493e-01,\n",
       "        4.56446108e-01, -4.22370601e-01,  8.02473240e-01,  1.50128880e-01,\n",
       "        6.83395951e-01,  2.06511496e-01,  2.67441747e-01,  8.38858830e-02,\n",
       "        3.20384092e-01,  6.08116886e-01,  3.70467301e-01, -1.47024656e-02,\n",
       "        2.73821126e-01, -2.22213948e-01,  3.45943407e-01,  2.92928823e-01,\n",
       "       -5.43679120e-01,  1.20502523e-01,  5.61094405e-01, -9.07648816e-02,\n",
       "       -9.08701304e-02,  5.10690412e-01, -1.53761912e-01,  4.23909767e-01,\n",
       "       -4.37683251e-01,  3.16901267e-01,  3.95289983e-01, -2.98683737e-01,\n",
       "        2.21080367e-01,  8.72769946e-02,  1.29061883e-01,  1.20706128e-01,\n",
       "       -1.15802828e-01, -7.05581525e-02, -3.29695109e-01,  2.26276519e-01,\n",
       "        8.84738248e-01,  1.01021425e-01, -1.40474023e-01, -2.09845104e-01,\n",
       "       -7.54611778e-02, -5.61162007e-02, -1.97142810e-01,  7.53374053e-02,\n",
       "        5.01884547e-01, -3.95690048e-01, -1.22748153e-01,  4.00627789e-01,\n",
       "       -2.17081618e-01, -2.00329194e-01, -1.83593487e-01, -1.00495352e-01,\n",
       "       -1.27141121e-01,  4.24449987e-02, -9.90337386e-03,  7.94286450e-02,\n",
       "       -5.22270616e-01,  1.33174475e-01,  9.51870865e-02,  5.27839777e-01,\n",
       "        5.82771390e-01,  3.33622667e-01, -5.25593641e-01, -4.71537211e-01,\n",
       "       -4.27592018e-01,  1.73267690e-01,  3.51133339e-01, -1.63715494e-01,\n",
       "        2.90653134e-01, -2.67936877e-01,  1.74063236e-01,  1.31693212e-01,\n",
       "        2.69151375e-01, -1.73910919e-01,  3.56312451e-01, -2.95944572e-01,\n",
       "       -6.76934153e-01,  1.15260613e-01,  7.46456877e-01, -3.96934560e-01,\n",
       "        6.12222822e-01, -2.83372799e-02, -3.51497255e-01,  6.33327038e-01,\n",
       "        2.71835304e-01,  2.94155228e-01,  5.16169771e-02,  1.08403732e-02,\n",
       "        6.67114061e-01, -1.82332792e-01, -1.21011324e-01, -4.54893674e-01,\n",
       "       -7.96243813e-01,  3.81974552e-01, -1.07648224e-01, -1.71062783e-02,\n",
       "        2.23721369e-01,  1.36297632e-01,  3.11243091e-01, -7.24278109e-03,\n",
       "        6.13225615e-02, -3.55728676e-01,  1.71853952e-01,  8.08492454e-01,\n",
       "        1.08213626e-01,  5.32913239e-02, -9.11289836e-02, -7.65856673e-02,\n",
       "        1.58455383e-01, -5.23370219e-01, -3.94924071e-01, -1.09364829e-01,\n",
       "       -3.78914862e-01, -1.89939895e-01, -1.36739936e-01, -2.63573754e-01,\n",
       "        1.01121780e-01,  7.21476287e-02, -9.03630706e-02,  3.04653321e-01,\n",
       "        1.10769228e-01, -6.11767145e-01,  2.23789775e-01,  1.68579657e-01,\n",
       "        5.42411777e-01,  4.75852309e-01,  5.58213127e-01, -4.18131792e-02,\n",
       "       -4.55969471e-01, -5.50211628e-01, -6.02525807e-01, -1.94368141e-01,\n",
       "       -3.66442961e-01, -1.26307742e-01,  1.02601869e-01, -5.58109635e-01,\n",
       "       -2.63724315e-01, -4.75675963e-01,  2.74982443e-01, -4.08772709e-01,\n",
       "       -7.00331777e-02, -4.36649556e-01, -6.36961440e-02, -3.30293833e-01,\n",
       "       -2.75192094e-01, -3.93795722e-01, -3.22032629e-01,  4.70539660e-01,\n",
       "        2.46172417e-01, -1.01038986e-01, -3.22538044e-01, -4.07975457e-01,\n",
       "        7.72229576e-02, -2.91750927e-01, -2.78907089e-01, -7.91286409e-02,\n",
       "       -1.04626150e-01,  3.41475254e-01, -1.85888436e-01, -2.36925667e-01,\n",
       "       -1.33571157e-01, -1.26144809e-01, -4.56658067e-01,  4.44035026e-01,\n",
       "        1.60457468e-02, -2.75158258e-01])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.predict(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51eb37e9",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cbba4ed",
   "metadata": {},
   "source": [
    "[1] David Wissel, Nikita Janakarajan, Daniel Rowson, Julius Schulte, Xintian Yuan, Valentina Boeva. \"sparsesurv: Sparse survival models via knowledge distillation.\" (2023, under review).\n",
    "\n",
    "[2] Paul, Debashis, et al. \"“Preconditioning” for feature selection and regression in high-dimensional problems.\" (2008): 1595-1618.\n",
    "\n",
    "[3] Pavone, Federico, et al. \"Using reference models in variable selection.\" Computational Statistics 38.1 (2023): 349-371.\n",
    "\n",
    "[4] Pölsterl, Sebastian. \"scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn.\" The Journal of Machine Learning Research 21.1 (2020): 8747-8752.\n",
    "\n",
    "[5] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. \"Distilling the knowledge in a neural network.\" arXiv preprint arXiv:1503.02531 (2015)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "144df90d",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}