Survival Analysis of Students Not Graduated on Time Using Cox Proportional Hazard Regression Method and Random Survival Forest Method

Abstract


INTRODUCTION
Higher education is a place to educate the next generation of the nation in terms of academic and non-academic.Every university tries to maximize the graduation of its students, both in quantity and quality.The undergraduate education program is targeted to complete 8 semesters of study or can also be taken in less than 8 semesters and a maximum of 14 semesters.Students are said to have graduated from college if they have completed all courses and academic programs required by each study program.The quality of graduates from universities can be influenced by several factors, both internal and external.These internal and external factors are thought to affect the length of student study in completing the education being pursued [1].Therefore, researchers are interested in conducting research to determine the factors that affect the length of study for undergraduate students at FMIPA UNIB class 2017.
The analysis used to examine the factors that affect the length of study of students in this study is survival analysis.Survival analysis is a statistical method used when the data case is related to the time until a certain event occurs.This study used the Cox proportional hazard and random survival forest methods because they were able to handle censored data.Random survival forest is a collection of random tree methods used for right-censored survival time data.The joint incident in this study was 2 or more students who had a thesis trial in the same month.The data is said to be censored if the thesis trial student is more than 48 months.Therefore, the author wants to conduct a study entitled " Survival Analysis of Students Not Graduated on Time Using Cox Proportional Hazard Regression Method and Random Survival Forest Method ".

Survival Analysis
Survival analysis is a statistical method related to time, which starts from the time origin or start point to a special event (failure event or end point).Cox regression is a survival analysis used to analyze data with the dependent variable in the form of survival time.Survival time is the time from the start of the study to the time of the occurrence of an event or events [2].

Censored Data
Data is said to be censored if the individual or observation has not experienced a certain event.If the individual experiences an event before the end of the observation, it is called uncensored data [3].

Opportunity Density Function
If  is a random variable of the lifetime of an individual in the interval [0, ∞), then the probability density function is () and the cumulative distribution function is ().The survival time  has a probability density function which is defined as the individual probability of failure in the time interval  to  + ∆ or the probability of failure in the interval per unit time.This can be expressed as [4].
∆ While the cumulative distribution function is:

Survival Function
If T is a random variable in the interval [0, ∞) which indicates the time an individual experiences an event in the population, () is a function of the probability density of , then the probability of an individual not experiencing an event until time  is expressed by the survival function () [4].The relationship between the probability density, the cumulative distribution function of  and the survival function is () =  ′ () = −′()

Hazard Function
Suppose  is a random variable in the interval [0, ∞) which indicates the time an individual experiences an event in a population, then the probability that an individual experiences an event in the interval (,  + ∆) is expressed by the hazard function ℎ() [4].

Cumulative Hazard Function
The following is the result of substituting Equation (1) into Equation (2) [4]:

Cox Proportional Hazards Regression
Cox proportional hazards regression or known as the cox regression model is used to determine the relationship between the dependent variable and the independent variable, where the data used in the cox proportional hazards regression is data on the survival time of an individual Cox proportional hazards regression model is as follows [5]: (3)

Shared Event Data
Occurrence data are often found in survival analyses.A joint event is an event where two or more individuals experience an event at the same time.

CPH Model Parameter Estimation for occurrence
The alternative methods offered by [5] to handle co-occurrence data are the Breslow partial likelihood method, the Efron partial likelihood method, and the Exact partial likelihood method.The following is the partial likelihood equation for the Breslow method:

Parameter Test
There are three ways to test the significance of the parameters, namely the partial likelihood ratio test, the Score test, and the Wald test.Parameter significance testing aims to check whether the independent variables have a real influence in the model.The simultaneous test in this study used a partial likelihood ratio test, while the partial test used a score test [6].

Assumptions of the Cox Proportional Hazard Model
There are two ways to test the proportional hazard assumption in a Cox proportional hazard model.The two methods are a graphical approach using a log-minus-log survival plot and using a Schoenfeld residual plot.Schoenfeld residuals are defined as residuals in which each individual and each independent variable is based on the first derivative of the log likelihood function [3].The Schoenfeld residual for the i-th individual on the j-th independent variable is as follows: ) ,  = 1,2, . . .,

Random Survival Forests
Random Survival Forest is a machine learning method in survival analysis that can be used to make predictions involving many independent variables and is also used for large amounts of data.Random survival forest itself is a collection of random tree methods used for right-censored survival data.This method only relies on data and does not rely on model assumptions so that it is considered a method that can predict survival and selection of variables better [7].

Splitting
A log-rank split rule that divides the vertices by maximizing the log-rank test statistic.Suppose at a node h in the process of tree formation we want to split into two child nodes.Suppose that at node ℎ there are  observations with survival times along with censorship indicators ( 1 ,  1 ), . . ., (  ,   ) where observation  is said to be censored at time   if   = 0, otherwise it is said to be uncensored at time   if   = 1.The log-rank test statistics for splitting based on the independent variable  at the value of  are: 

Bootstrap
Bootstrap is a nonparametric resampling technique that can work without the need for distribution assumptions because the original sample will be used as the population.The steps in the bootstrap method are sampling back from the initial dataset to get new data.The sampling technique was returned from the original data with the same size, and was returned B times.The original sample is the initial sample generated from observations that are considered as a population [8].

Variable Importance
The independent variable is selected by filtering based on the importance of the variable.The representative method used for the selection of the importance variable is the importance permutation method.Permutation importance is a method to calculate the level of importance of variable  in the average of the t-tree from the difference between prediction errors after permutation and before permutation of variable .A variable with a positive importance value indicates that the variable has good predictive ability.Meanwhile, if the importance value is zero or negative, then the variable is non-predictive [9].

Comparing the Cox Proportional Hazard Method with the Random Survival Forest Method
Comparison of the Cox proportional hazard and random survival forest models or commonly referred to as prediction errors can be calculated using the Harrell's concordance index (C-Index) approach.The C-Index method is a tool to assess the accuracy of prediction performance in survival analysis which is quite popular.The comparison is made by comparing the error values of the two methods with the C-Index, where the error value is 1 -C-Index, so it can be said that a larger C-Index value produces smaller errors and provides better prediction accuracy [7].
The following is the Harrell's concordance index equation to assess the accuracy of prediction performance in survival analysis

Length of Study
The length of study is the time required for students to complete their education from the time they enter until they graduate.The quality of a student's graduation is expected to be used as capital to find work according to the expertise possessed.The quality of student graduates is influenced by several factors, namely internal factors and external factors [10].The existence of internal factors and external factors is very influential for a student in taking his education.Internal factors are factors that come from within the student, such as intelligence, emotion, level of intelligence, psychological state, and others.On the other hand, external factors are factors that come from outside the individual, such as the family environment, community environment, campus environment, educational infrastructure provided by the campus, and also the learning motivation given to them [11].

METHOD
The types of data in this study are nominal data and numerical data.Nominal data is categorical data.The nominal data in this study are data on factors that affect the length of study for students, gender ( 2 ), parents' occupation ( 3 ), regional origin ( 4 ), entrance to university ( 5 ), scholarships ( 6 ) and Part time ( 7 ).While the numerical data in this study is the GPA ( 1 ).The data used in this study are primary data and secondary data.Primary data was obtained by conducting direct interviews with electronic media (telephone and whatsapp) and distributing google forms.The secondary data referred to in this study is data on the length of study of students at FMIPA UNIB class 2017.
The stages of the research carried out are as follows: 1. Data exploration 2. Cox proportional hazard regression analysis a. Parameter estimation b.Parameter testing with simultaneous test and partial test c.Testing the proportional hazard assumption with the Schoenfeld residual residual plot d.Cox proportional hazard model interpretation 3. Analysis of random survival forest a.Take bootstrap samples from real survival data b.Form a survival tree from each bootstrap sample c.Choose the independent variable randomly at each tree node d.Split (split) tree nodes using independent variables e. Calculate CHF ensemble value f.Selection of variable importance 4. Comparing the CPH method with the RSF method using the C-Index with the smallest error value 5. Get the best method

Parameter Estimation
There are three ways to test the significance of the parameters, namely the partial likelihood ratio test, the Score test, and the Wald test.Parameter significance testing aims to check whether the independent variables have a real influence in the model.The simultaneous test in this study used a partial likelihood ratio test, while the partial test used a score test [6].

Parameter Test
To find out whether all the variables in equation ( 4) have an effect on the model, then the parameter testing is carried out with the partial likelihood ratio test as follows: There is at least one independent variable that has an effect on the model.
The results of the partial parameter testing using the score test, as follows:

Testing the Proportional Hazard Assumption
In this Schoenfeld residual, if the slope curve is close to zero, then the curve indicates that the coefficient of  1 is constant.So it can be interpreted that the proportional hazard assumption is met.The following is a plot of the Schoenfeld residual of the GPA variable.

Interpretation of the Cox Proportional Hazard Model
The final cox proportional hazard model is as follows: ℎ(, ) = ℎ 0 ()(3.8307 1 ) Equality (5) shows the value of  1 which shows the effect of the independent variable on the hazard function.In this study, the GPA variable is a numerical variable, so the hazard ratio is obtained by taking a number that ranges between the GPA values.It is assumed that in this study comparing the cum laude GPA (GPA 3.7) with the non-cumlaude (GPA 3.32), it is obtained Based on the calculation results, it can be said that students with a GPA of 3.7 are 4,287 times faster to graduate on time than students who have a GPA of 3.32.

Random Survival Forest
The first step in the random survival forest method is dividing the data into 75% training data and 25% testing data.The following are the results of data processing using the random survival forest method: Based on the calculation results, it can be said that students with a GPA of 3.7 are 4,287 times faster to graduate on time than students who have a GPA of 3.32.3. The results of the comparison of methods using the harrell's concordance index state that the random survival forest method is more suitable for use in data on the length of study for undergraduate students at FMIPA UNIB class of 2017 because the resulting C-Index error value is 26.9%, smaller than the cox proportional hazard method.that is 27.8%.

Figure 1 .
Figure 1.Schoenfeld Residual Plot for GPA Variables

Table 1 .
Parameter Estimation of CPH Model with Breslow ApproachSo that the estimation of the Cox proportional hazard model with the Breslow approach is obtained as follows:

Table 2 .
Partial Parameter Test Results with Score Test

Table 3 .
Results of RSF Method Data Processing