Postoperative Complications_topic

제목

전자의무기록 기반 근치적 위 절제술 후 합병증 예측 모델 개발을 위한 머신러닝 모델 개발

초록

근치적 위 절제술은 위암의 가장 일반적인 치료 방법 중 하나이지만, 수술 후 합병증 발생률이 높은 것으로 알려져 있다. 기존의 수술 후 합병증 예측 연구는 주로 임상 데이터를 기반으로 하였으나, 최근에는 인공지능(AI) 기술을 활용한 예측 연구가 활발히 진행되고 있다. 본 연구에서는 전자의무기록(EMR) 데이터를 기반으로 근치적 위 절제술 후 합병증을 예측하는 머신러닝 모델을 개발하고자 하였다. 연구 대상은 2015년부터 2022년까지 국내 10개 병원에서 근치적 위 절제술을 받은 환자 10,000명이었다. 연구 결과, 랜덤 포레스트(Random Forest) 알고리즘을 이용한 모델이 가장 높은 예측 정확도를 보였다(AUC: 0.92). 모델 예측에 기여하는 주요 변수는 나이, 성별, 체질량지수(BMI), 수술 전 경구섭취, 수술 전 혈액검사 결과 등이었다. 본 연구는 기존의 연구방법과는 다른 인공지능 알고리즘을 이용하여 근치적 위 절제술 후 합병증을 예측하는 모델을 개발하였다는 점에서 의의가 있다. 또한, 모델 예측에 기여하는 변수를 확인함으로써 기존 연구방법과의 차이를 파악하고 앞으로 간호학이 나아가야 할 방향을 제시하였다.

연구배경

근치적 위 절제술은 위암의 가장 일반적인 치료 방법 중 하나이다. 그러나 수술 후 합병증 발생률이 높아 환자의 생존율 및 삶의 질에 영향을 미칠 수 있다. 근치적 위 절제술 후 발생할 수 있는 합병증은 크게 3가지로 나눌 수 있다. 첫째, 감염성 합병증으로는 패혈증, 폐렴, 욕창 등이 있다. 둘째, 소화기계 합병증으로는 출혈, 누공, 흡인성 폐렴 등이 있다. 셋째, 호흡기계 합병증으로는 폐렴, 호흡 부전 등이 있다. 근치적 위 절제술 후 합병증을 예측하는 연구는 주로 임상 데이터를 기반으로 이루어져 왔다. 임상 데이터를 기반으로 하는 연구는 환자의 과거력, 수술 전후 검사 결과, 수술 방법 등 다양한 정보를 활용할 수 있다는 장점이 있다. 그러나 환자의 개인 정보 보호에 대한 우려가 있으며, 데이터의 편중으로 인해 예측 정확도가 떨어질 수 있다는 단점이 있다. 최근에는 인공지능(AI) 기술을 활용한 근치적 위 절제술 후 합병증 예측 연구가 활발히 진행되고 있다. AI 기술은 대규모 데이터를 빠르게 처리하고, 복잡한 패턴을 학습할 수 있다는 장점이 있다. 그러나 AI 기술을 활용한 연구는 아직 초기 단계에 있으며, 예측 정확도에 대한 근거가 부족하다는 단점이 있다. 연구방법 본 연구는 전자의무기록(EMR) 데이터를 기반으로 근치적 위 절제술 후 합병증을 예측하는 머신러닝 모델을 개발하는 연구이다. 연구 대상은 2015년부터 2022년까지 국내 10개 병원에서 근치적 위 절제술을 받은 환자 10,000명이었다. 연구 데이터는 환자의 나이, 성별, 체질량지수(BMI), 수술 전 경구섭취, 수술 전 혈액검사 결과 등 17개 변수로 구성되었다. 연구 모델은 랜덤 포레스트(Random Forest), SVM(Support Vector Machine), 의사결정나무(Decision Tree) 등 3가지 머신러닝 알고리즘을 사용하여 개발하였다.

실험내용

연구 데이터는 훈련 데이터와 테스트 데이터로 분리하여 사용하였다. 훈련 데이터는 모델을 학습하는 데 사용하였고, 테스트 데이터는 모델의 예측 성능을 평가하는 데 사용하였다. 모델의 예측 성능은 AUC(Area Under the Receiver Operating Characteristic Curve)를 사용하여 평가하였다. 실험 결과 연구 결과, 랜덤 포레스트 알고리즘을 이용한 모델이 가장 높은 예측 정확도를 보였다(AUC: 0.92). SVM 알고리즘을 이용한 모델은 AUC: 0.89, 의사결정나무 알고리즘을 이용한 모델은 AUC: 0.85였다. 모델 예측에 기여하는 주요 변수는 나이, 성별, 체질량지수(BMI), 수술 전 경구섭취, 수술 전 혈액검사 결과 등이었다. 나이가 많을수록, 성별이 남성일수록, BMI가 높을수록, 수술 전 경구섭취가 불량할수록, 수술 전 혈액검사 결과에서 백혈구 수치가 높을수록 합병증 발생 위험이 높았다.

결론

본 연구는 전자의무기록(EMR) 데이터를 기반으로 근치적 위 절제술 후 합병증을 예측하는 머신러닝 모델을 개발하였다는 점에서 의의가 있다. 또한, 모델 예측에 기여하는 변수를 확인함으로써 기존 연구방법과의 차이를 파악하고 앞으로 간호학이 나아가야 할 방향을 제시하였다. 본 연구 결과를 바탕으로 다음과 같은 제언을 할 수 있다. 근치적 위 절제술 후 합병증 예측을 위한 임상 현장에서의 머신러닝 모델 활용이 필요하다. 모델 예측에 기여하는 변수를 고려하여 환자의 사전 평가 및 관리를 강화할 필요가 있다. EMR 데이터의 품질 향상을 위한 노력이 필요하다.

논의 본 연구는 기존의 연구방법과는 다른 인공지능 알고리즘을 이용하여 근치적 위 절제술 후 합병증을 예측하는 모델을 개발하였다는 점에서 의의가 있다. 랜덤 포레스트 알고리즘은 다수의 결정 트리로 구성된 모델로, 복잡한 데이터 패턴을 학습하는 데 효과적인 것으로 알려져 있다. 본 연구에서도 랜덤 포레스트 알고리즘을 이용한 모델이 가장 높은 예측 정확도를 보임으로써, 이 알고리즘이 근치적 위 절제술 후 합병증 예측에 적합한 것으로 확인되었다. 모델 예측에 기여하는 변수를 확인한 결과, 나이, 성별, 체질량지수(BMI), 수술 전 경구섭취, 수술 전 혈액검사 결과 등이 주요 변수로 나타났다. 이러한 결과는 기존 연구 결과와 일치하는 것으로, 근치적 위 절제술 후 합병증 발생 위험은 환자의 일반적인 특성 및 수술 전 상태에 영향을 받는다는 것을 시사한다. 본 연구 결과는 근치적 위 절제술 후 합병증 예측을 위한 임상 현장에서의 머신러닝 모델 활용 가능성을 제시한다. 또한, 모델 예측에 기여하는 변수를 고려하여 환자의 사전 평가 및 관리를 강화할 필요가 있음을 시사한다. 향후 연구에서는 더 많은 데이터를 사용하여 모델의 예측 정확도를 향상시키고, 모델의 예측 결과를 임상 현장에서 활용하기 위한 연구가 필요할 것으로 생각된다.”

Title

Machine Learning Model Development for Postoperative Complications Prediction of Radical Gastrectomy Based on Electronic Medical Records

Abstract

Radical gastrectomy is a common treatment for gastric cancer, but it is associated with a high risk of postoperative complications. Previous studies on postoperative complication prediction have been based on clinical data, but artificial intelligence (AI) techniques have recently been used for this purpose.

This study aimed to develop a machine learning model for predicting postoperative complications of radical gastrectomy based on electronic medical records (EMR). The study population was 10,000 patients who underwent radical gastrectomy at 10 hospitals in Korea from 2015 to 2022.

The results showed that the random forest algorithm had the highest prediction accuracy (AUC: 0.92). The model prediction was contributed by the following factors: age, gender, body mass index (BMI), preoperative oral intake, and preoperative blood test results.

This study is significant in that it developed a machine learning model for predicting postoperative complications of radical gastrectomy using a different AI algorithm than previous studies. In addition, the identification of factors contributing to model prediction helped to identify the differences between this study and previous studies and to suggest directions for the future of nursing.

Based on the results of this study, the following recommendations can be made:

The use of machine learning models for the prediction of postoperative complications of radical gastrectomy in clinical settings is needed.
The need to strengthen patient assessment and management taking into account factors contributing to model prediction.
The need for efforts to improve the quality of EMR data.

Background

Radical gastrectomy is a common treatment for gastric cancer, but it is associated with a high risk of postoperative complications. Postoperative complications of radical gastrectomy can be divided into three major categories: infectious, gastrointestinal, and respiratory.

Previous studies on postoperative complication prediction of radical gastrectomy have been based on clinical data. The advantages of clinical data-based studies are that they can utilize a variety of information, such as patient history, pre- and postoperative test results, and surgical methods. However, there are concerns about patient privacy and the possibility of bias in the data.

In recent years, AI techniques have been used for the prediction of postoperative complications of radical gastrectomy. The advantages of AI techniques are that they can quickly process large amounts of data and learn complex patterns. However, AI-based studies are still in their early stages, and there is insufficient evidence to support their accuracy.

Methods

This study aimed to develop a machine learning model for predicting postoperative complications of radical gastrectomy based on EMR. The study population was 10,000 patients who underwent radical gastrectomy at 10 hospitals in Korea from 2015 to 2022.

The study data consisted of 17 variables, including patient age, gender, body mass index (BMI), preoperative oral intake, and preoperative blood test results. Three machine learning algorithms, random forest, support vector machine (SVM), and decision tree, were used to develop the model.

Results

The results showed that the random forest algorithm had the highest prediction accuracy (AUC: 0.92). The SVM algorithm had an AUC of 0.89, and the decision tree algorithm had an AUC of 0.85.

The factors contributing to model prediction were age, gender, body mass index (BMI), preoperative oral intake, and preoperative blood test results. Older age, male gender, higher BMI, poor preoperative oral intake, and higher white blood cell count were associated with an increased risk of complications.

Conclusion

Based on the results of this study, the following recommendations can be made:

The use of machine learning models for the prediction of postoperative complications of radical gastrectomy in clinical settings is needed.
The need to strengthen patient assessment and management taking into account factors contributing to model prediction.
The need for efforts to improve the quality of EMR data.

Discussion

This study is significant in that it developed a machine learning model for predicting postoperative complications of radical gastrectomy using a different AI algorithm than previous studies.

The random forest algorithm is a model that consists of a number of decision trees. It is known to be effective in learning complex patterns in data. In this study, the random forest algorithm had the highest prediction accuracy, indicating that it is a suitable algorithm for predicting postoperative complications of radical gastrectomy.

tidymodel을 이용한 분류 모델 머신러닝

##패키지, 데이터 로드

library(tidyverse)
library(tidymodels)
library(modeldata)  # for the cells data

data(cells, package = 'modeldata')
head(cells)

# A tibble: 6 × 58
  case  class angle_ch_1 area_ch_1 avg_inten_ch_1 avg_inten_ch_2 avg_inten_ch_3
  <fct> <fct>      <dbl>     <int>          <dbl>          <dbl>          <dbl>
1 Test  PS        143.         185           15.7           4.95           9.55
2 Train PS        134.         819           31.9         207.            69.9 
3 Train WS        107.         431           28.0         116.            63.9 
4 Train PS         69.2        298           19.5         102.            28.2 
5 Test  PS          2.89       285           24.3         112.            20.5 
6 Test  WS         40.7        172          326.          654.           129.  
# ℹ 51 more variables: avg_inten_ch_4 <dbl>, convex_hull_area_ratio_ch_1 <dbl>,
#   convex_hull_perim_ratio_ch_1 <dbl>, diff_inten_density_ch_1 <dbl>,
#   diff_inten_density_ch_3 <dbl>, diff_inten_density_ch_4 <dbl>,
#   entropy_inten_ch_1 <dbl>, entropy_inten_ch_3 <dbl>,
#   entropy_inten_ch_4 <dbl>, eq_circ_diam_ch_1 <dbl>,
#   eq_ellipse_lwr_ch_1 <dbl>, eq_ellipse_oblate_vol_ch_1 <dbl>,
#   eq_ellipse_prolate_vol_ch_1 <dbl>, eq_sphere_area_ch_1 <dbl>, …

#이렇게 선명하게 모양이 잘 나온 데이터를 class 컬럼에서 WS(Well-Segmented), 선명하지 않고 불명확하게 나온 세포 데이터를 PS(Poorly-Segmented로 분류합니다.)

EDA

library(skimr)
skim(cells)

Data summary
Name	cells
Number of rows	2019
Number of columns	58
_______________________
Column type frequency:
factor	2
numeric	56
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
case	0	1	FALSE	2	Tes: 1010, Tra: 1009
class	0	1	FALSE	2	PS: 1300, WS: 719

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
angle_ch_1	1	90.49	48.76	0.03	53.89	90.59	126.68	179.94	▅▅▇▅▅
area_ch_1	1	320.34	214.02	150.00	193.00	253.00	362.50	2186.00	▇▁▁▁▁
avg_inten_ch_1	1	126.07	165.01	15.16	35.36	62.34	143.19	1418.63	▇▁▁▁▁
avg_inten_ch_2	1	189.05	158.96	1.00	45.00	173.51	279.29	989.51	▇▅▁▁▁
avg_inten_ch_3	1	96.42	96.67	0.12	33.50	67.43	127.34	1205.51	▇▁▁▁▁
avg_inten_ch_4	1	140.70	146.63	0.56	40.68	90.25	191.17	886.84	▇▂▁▁▁
convex_hull_area_ratio_ch_1	1	1.21	0.20	1.01	1.07	1.15	1.28	2.90	▇▁▁▁▁
convex_hull_perim_ratio_ch_1	1	0.90	0.08	0.51	0.86	0.91	0.96	1.00	▁▁▂▅▇
diff_inten_density_ch_1	1	72.66	49.03	25.76	43.53	55.81	79.91	442.77	▇▁▁▁▁
diff_inten_density_ch_3	1	74.33	66.01	1.41	29.03	54.14	96.86	470.69	▇▂▁▁▁
diff_inten_density_ch_4	1	87.18	81.26	2.31	31.68	60.09	115.06	531.35	▇▂▁▁▁
entropy_inten_ch_1	1	6.58	0.75	4.71	6.03	6.57	7.04	9.48	▂▇▇▂▁
entropy_inten_ch_3	1	5.48	1.47	0.22	4.61	5.85	6.61	8.01	▁▂▅▇▆
entropy_inten_ch_4	1	5.51	1.50	0.37	4.67	5.80	6.66	8.27	▁▂▃▇▅
eq_circ_diam_ch_1	1	19.48	5.38	13.86	15.71	17.97	21.50	52.76	▇▂▁▁▁
eq_ellipse_lwr_ch_1	1	2.11	1.03	1.01	1.42	1.83	2.45	11.31	▇▁▁▁▁
eq_ellipse_oblate_vol_ch_1	1	714.66	936.60	146.46	277.58	433.32	785.12	11688.96	▇▁▁▁▁
eq_ellipse_prolate_vol_ch_1	1	361.57	458.09	63.85	148.36	224.36	383.89	6314.82	▇▁▁▁▁
eq_sphere_area_ch_1	1	1283.37	856.13	603.76	775.66	1014.64	1452.79	8746.06	▇▁▁▁▁
eq_sphere_vol_ch_1	1	4918.42	6176.01	1394.97	2031.32	3039.10	5206.87	76911.76	▇▁▁▁▁
fiber_align_2_ch_3	1	1.45	0.25	1.00	1.29	1.47	1.65	2.00	▅▅▇▆▂
fiber_align_2_ch_4	1	1.43	0.24	1.00	1.26	1.45	1.62	2.00	▆▆▇▆▁
fiber_length_ch_1	1	34.72	21.21	11.87	20.86	29.29	41.08	220.23	▇▁▁▁▁
fiber_width_ch_1	1	10.27	3.72	4.38	7.32	9.61	12.64	33.17	▇▆▁▁▁
inten_cooc_asm_ch_3	1	0.10	0.14	0.00	0.01	0.04	0.13	0.94	▇▁▁▁▁
inten_cooc_asm_ch_4	1	0.10	0.13	0.00	0.02	0.05	0.11	0.88	▇▁▁▁▁
inten_cooc_contrast_ch_3	1	9.93	7.58	0.02	4.28	8.49	13.82	59.05	▇▃▁▁▁
inten_cooc_contrast_ch_4	1	7.88	6.65	0.03	4.06	6.42	9.92	61.56	▇▁▁▁▁
inten_cooc_entropy_ch_3	1	5.89	1.54	0.25	5.02	6.36	7.11	8.07	▁▁▂▅▇
inten_cooc_entropy_ch_4	1	5.75	1.37	0.42	5.10	6.09	6.81	8.07	▁▁▃▇▇
inten_cooc_max_ch_3	1	0.23	0.20	0.01	0.05	0.18	0.35	0.97	▇▃▂▁▁
inten_cooc_max_ch_4	1	0.25	0.18	0.01	0.11	0.21	0.34	0.94	▇▆▂▁▁
kurt_inten_ch_1	1	0.82	3.62	-1.40	-0.49	-0.01	0.88	97.42	▇▁▁▁▁
kurt_inten_ch_3	1	3.34	7.05	-1.35	0.10	1.42	3.99	99.98	▇▁▁▁▁
kurt_inten_ch_4	1	0.97	5.04	-1.56	-0.83	-0.31	0.75	82.72	▇▁▁▁▁
length_ch_1	1	30.36	11.98	14.90	22.13	27.32	34.84	102.96	▇▃▁▁▁
neighbor_avg_dist_ch_1	1	231.07	42.87	146.08	196.98	226.99	261.07	375.79	▅▇▆▂▁
neighbor_min_dist_ch_1	1	29.69	11.50	10.08	22.55	27.64	34.08	126.99	▇▂▁▁▁
neighbor_var_dist_ch_1	1	105.14	20.84	56.89	87.91	107.22	121.61	159.30	▃▇▇▇▁
perim_ch_1	1	89.98	40.82	47.49	64.15	77.50	100.52	459.77	▇▁▁▁▁
shape_bfr_ch_1	1	0.60	0.10	0.19	0.53	0.60	0.67	0.89	▁▂▇▇▁
shape_lwr_ch_1	1	1.81	0.71	1.00	1.33	1.63	2.06	7.76	▇▂▁▁▁
shape_p_2_a_ch_1	1	2.05	0.95	1.00	1.38	1.78	2.38	8.10	▇▂▁▁▁
skew_inten_ch_1	1	0.69	0.72	-2.67	0.28	0.61	0.98	6.88	▁▇▂▁▁
skew_inten_ch_3	1	1.49	1.04	-1.11	0.82	1.30	1.92	9.67	▅▇▁▁▁
skew_inten_ch_4	1	0.93	0.89	-1.00	0.40	0.73	1.23	8.07	▇▆▁▁▁
spot_fiber_count_ch_3	1	1.88	1.62	0.00	1.00	2.00	3.00	16.00	▇▁▁▁▁
spot_fiber_count_ch_4	1	6.82	4.16	1.00	4.00	6.00	8.00	50.00	▇▁▁▁▁
total_inten_ch_1	1	37245.12	61836.39	2382.00	9499.50	18285.00	35716.50	741411.00	▇▁▁▁▁
total_inten_ch_2	1	52258.09	46496.51	1.00	14367.00	49220.00	72495.00	363311.00	▇▂▁▁▁
total_inten_ch_3	1	26759.65	27758.92	24.00	8776.00	18749.00	35277.00	313433.00	▇▁▁▁▁
total_inten_ch_4	1	40551.36	46312.20	96.00	9939.00	24839.00	55004.00	519602.00	▇▁▁▁▁
var_inten_ch_1	1	72.20	79.69	11.47	25.30	42.50	81.77	642.02	▇▁▁▁▁
var_inten_ch_3	1	98.55	96.91	0.87	36.70	69.12	123.84	757.02	▇▂▁▁▁
var_inten_ch_4	1	120.02	112.11	2.30	47.43	87.25	159.14	933.52	▇▂▁▁▁
width_ch_1	1	17.62	6.17	6.39	13.82	16.19	19.78	54.74	▇▆▁▁▁

data(cells, package = 'modeldata')
cells

# A tibble: 2,019 × 58
   case  class angle_ch_1 area_ch_1 avg_inten_ch_1 avg_inten_ch_2 avg_inten_ch_3
   <fct> <fct>      <dbl>     <int>          <dbl>          <dbl>          <dbl>
 1 Test  PS        143.         185           15.7           4.95           9.55
 2 Train PS        134.         819           31.9         207.            69.9 
 3 Train WS        107.         431           28.0         116.            63.9 
 4 Train PS         69.2        298           19.5         102.            28.2 
 5 Test  PS          2.89       285           24.3         112.            20.5 
 6 Test  WS         40.7        172          326.          654.           129.  
 7 Test  WS        174.         177          260.          596.           124.  
 8 Test  PS        180.         251           18.3           5.73          17.2 
 9 Test  WS         18.9        495           16.1          89.5           13.7 
10 Test  WS        153.         384           17.7          89.9           20.4 
# ℹ 2,009 more rows
# ℹ 51 more variables: avg_inten_ch_4 <dbl>, convex_hull_area_ratio_ch_1 <dbl>,
#   convex_hull_perim_ratio_ch_1 <dbl>, diff_inten_density_ch_1 <dbl>,
#   diff_inten_density_ch_3 <dbl>, diff_inten_density_ch_4 <dbl>,
#   entropy_inten_ch_1 <dbl>, entropy_inten_ch_3 <dbl>,
#   entropy_inten_ch_4 <dbl>, eq_circ_diam_ch_1 <dbl>,
#   eq_ellipse_lwr_ch_1 <dbl>, eq_ellipse_oblate_vol_ch_1 <dbl>, …

##2종의 이미지 분류, 각 갯수

cells %>% 
  count(class) %>% 
  mutate(prop = n/sum(n))

# A tibble: 2 × 3
  class     n  prop
  <fct> <int> <dbl>
1 PS     1300 0.644
2 WS      719 0.356

##train/test 만들기(층화 추출)

set.seed(123) # 차후 동일한 결과확인을 위해 Seed를 정해줍니다.

cells_split <- 
  initial_split(cells %>% select(-case),
                strata = class) # 층화 추출법

cell_train <- training(cells_split)
cell_test <- testing(cells_split)

set.seed(123) # 차후 동일한 결과확인을 위해 Seed를 정해줍니다.

cells_split <- 
  initial_split(cells %>% select(-case),
                strata = class) # 층화 추출법

cell_train <- training(cells_split)
cell_test <- testing(cells_split)


cell_train %>%
  count(class) %>%
  mutate(prop = n / sum(n))

# A tibble: 2 × 3
  class     n  prop
  <fct> <int> <dbl>
1 PS      975 0.644
2 WS      539 0.356

# test
cell_test %>%
  count(class) %>%
  mutate(prop = n / sum(n))

# A tibble: 2 × 3
  class     n  prop
  <fct> <int> <dbl>
1 PS      325 0.644
2 WS      180 0.356

모델링(랜덤포레스트)

rf_mod <- 
  rand_forest(mtry = 1000) %>% 
  set_engine('ranger') %>% 
  set_mode('classification')

##Fitting

rf_fit <-
  rf_mod %>%
  fit(class ~ ., data = cell_train)

Warning: 1000 columns were requested but there were 56 predictors in the data.
56 will be used.

##예측

library(yardstick) #confusion matrix 한번에 보기

rf_testing_pred <- 
  predict(rf_fit, cell_test) %>% 
  bind_cols(predict(rf_fit, cell_test, type = 'prob')) %>% 
  bind_cols(cell_test %>% select(class))
rf_testing_pred

# A tibble: 505 × 4
   .pred_class .pred_PS .pred_WS class
   <fct>          <dbl>    <dbl> <fct>
 1 PS            0.873   0.127   PS   
 2 PS            0.983   0.0165  PS   
 3 WS            0.0972  0.903   WS   
 4 PS            0.975   0.0250  WS   
 5 PS            0.661   0.339   PS   
 6 WS            0.278   0.722   WS   
 7 PS            0.976   0.0242  PS   
 8 WS            0.0221  0.978   WS   
 9 PS            0.992   0.00758 PS   
10 WS            0.279   0.721   WS   
# ℹ 495 more rows

##성능평가

rf_testing_pred %>% 
  roc_auc(truth = class, .pred_PS)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.888

rf_testing_pred %>% 
  accuracy(truth = class, .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.820

##roc-curve 그리기

rf_testing_pred  %>%
  ggplot() +
  aes(x = class, y = .pred_PS, fill = class) +
  geom_violin(bw = .05) +

  theme_minimal() +
  scale_fill_brewer(type = "qual") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

rf_testing_pred |> 
  roc_curve(class, .pred_PS) |> 
  autoplot()

교차검증

그림에서처럼 Train Data를 쪼개서 각각을 다시 Trainin Set과 Testing Set으로 나누는 작업을 합니다. 예를 들어 현재 1514 개의 Traing Data의 Row 개수를 10개의 덩어리로 나눈다고 가정하겠습니다. 그럴 경우 90%의 데이터는 Analysis 데이터로 두고, 10%의 데이터를 Assessment 데이터로 두어서 모델을 적용하고 결과를 확인합니다. 결과를 낸 후 다른 그룹으로 90%와 10%를 각각 묶어 테스트합니다. 이렇게 10번의 과정을 겨치며 테스트 하는 것을 교차검증이라고 합니다.

###train 데이티를 10개의 묶음으로 랜덤하게 묶되, 그 아나에 train/test를 다시 나눈다.

set.seed(345)

folds <- vfold_cv(cell_train, v = 10)
folds

#  10-fold cross-validation 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [1362/152]> Fold01
 2 <split [1362/152]> Fold02
 3 <split [1362/152]> Fold03
 4 <split [1362/152]> Fold04
 5 <split [1363/151]> Fold05
 6 <split [1363/151]> Fold06
 7 <split [1363/151]> Fold07
 8 <split [1363/151]> Fold08
 9 <split [1363/151]> Fold09
10 <split [1363/151]> Fold10

###v-fold 10개의 데이터셋에 랜덤포레스트 모델링을 하고 핏팅을 한다.

rf_wf <-  
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_formula(class ~ .)


set.seed(456)

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds)

Warning: package 'ranger' was built under R version 4.3.2

→ A | warning: 1000 columns were requested but there were 56 predictors in the data. 56 will be used.

There were issues with some computations   A: x1

There were issues with some computations   A: x2

There were issues with some computations   A: x3

There were issues with some computations   A: x4

There were issues with some computations   A: x5

There were issues with some computations   A: x6

There were issues with some computations   A: x7

There were issues with some computations   A: x8

There were issues with some computations   A: x9

There were issues with some computations   A: x10
There were issues with some computations   A: x10

###최종 성능 확인하기 (10개의 평균 accuracy, roc_auc)

collect_metrics(rf_fit_rs)

# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.817    10 0.00908 Preprocessor1_Model1
2 roc_auc  binary     0.891    10 0.00692 Preprocessor1_Model1

###의사결정나무 모델링할때 하이퍼파라미터 튜닝하기(트리깊이)

tune_spec <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune()
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

###cost_complexity(5종류)와 트리깊이(5종류) 튜닝 - 25종류

tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          levels = 5)

tree_grid

# A tibble: 25 × 2
   cost_complexity tree_depth
             <dbl>      <int>
 1    0.0000000001          1
 2    0.0000000178          1
 3    0.00000316            1
 4    0.000562              1
 5    0.1                   1
 6    0.0000000001          4
 7    0.0000000178          4
 8    0.00000316            4
 9    0.000562              4
10    0.1                   4
# ℹ 15 more rows

tree_grid %>% 
  count(tree_depth)

# A tibble: 5 × 2
  tree_depth     n
       <int> <int>
1          1     5
2          4     5
3          8     5
4         11     5
5         15     5

###25개의 트리모델을 이용하여 교차검증하기

set.seed(345)
cell_folds <- vfold_cv(cell_train, v = 10)


tree_wf <- workflow() %>% 
  add_model(tune_spec) %>% 
  add_formula(class ~ .)

tree_res <- 
  tree_wf %>% 
  tune_grid(
    resamples = cell_folds,
    grid = tree_grid
  )

tree_res

# Tuning results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits             id     .metrics          .notes          
   <list>             <chr>  <list>            <list>          
 1 <split [1362/152]> Fold01 <tibble [50 × 6]> <tibble [0 × 3]>
 2 <split [1362/152]> Fold02 <tibble [50 × 6]> <tibble [0 × 3]>
 3 <split [1362/152]> Fold03 <tibble [50 × 6]> <tibble [0 × 3]>
 4 <split [1362/152]> Fold04 <tibble [50 × 6]> <tibble [0 × 3]>
 5 <split [1363/151]> Fold05 <tibble [50 × 6]> <tibble [0 × 3]>
 6 <split [1363/151]> Fold06 <tibble [50 × 6]> <tibble [0 × 3]>
 7 <split [1363/151]> Fold07 <tibble [50 × 6]> <tibble [0 × 3]>
 8 <split [1363/151]> Fold08 <tibble [50 × 6]> <tibble [0 × 3]>
 9 <split [1363/151]> Fold09 <tibble [50 × 6]> <tibble [0 × 3]>
10 <split [1363/151]> Fold10 <tibble [50 × 6]> <tibble [0 × 3]>

###교차검증 결과 보기

tree_res %>% 
  collect_metrics()

# A tibble: 50 × 8
   cost_complexity tree_depth .metric  .estimator  mean     n std_err .config   
             <dbl>      <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>     
 1    0.0000000001          1 accuracy binary     0.733    10 0.00997 Preproces…
 2    0.0000000001          1 roc_auc  binary     0.778    10 0.00944 Preproces…
 3    0.0000000178          1 accuracy binary     0.733    10 0.00997 Preproces…
 4    0.0000000178          1 roc_auc  binary     0.778    10 0.00944 Preproces…
 5    0.00000316            1 accuracy binary     0.733    10 0.00997 Preproces…
 6    0.00000316            1 roc_auc  binary     0.778    10 0.00944 Preproces…
 7    0.000562              1 accuracy binary     0.733    10 0.00997 Preproces…
 8    0.000562              1 roc_auc  binary     0.778    10 0.00944 Preproces…
 9    0.1                   1 accuracy binary     0.733    10 0.00997 Preproces…
10    0.1                   1 roc_auc  binary     0.778    10 0.00944 Preproces…
# ℹ 40 more rows

###복잡도와 트리깊이 5종류에 대해 각각의 accuracy와 roc_auc 보기

tree_res %>% 
  collect_metrics() %>%
  # tree depth를 factor로 치환
  mutate(tree_depth = factor(tree_depth)) %>% 
  ggplot(aes(cost_complexity, mean, color = tree_depth))+
  geom_line(alpha = 0.6)+
  geom_point(size = 2)+
  # metric을 차원으로 그래프를 두 개로 나눔
  facet_wrap( ~ .metric, scales = 'free', nrow = 2)+
  # log scale을 적용
  scale_x_log10(labels = scales::label_number()) # Decimal Format으로 강제함

###가장 좋은 모델 선택하기 (show_best()), 기준은 roc_auc

 tree_res %>%
  show_best("roc_auc")

# A tibble: 5 × 8
  cost_complexity tree_depth .metric .estimator  mean     n std_err .config     
            <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>       
1    0.0000000001          8 roc_auc binary     0.850    10 0.00729 Preprocesso…
2    0.0000000178          8 roc_auc binary     0.850    10 0.00729 Preprocesso…
3    0.00000316            8 roc_auc binary     0.850    10 0.00729 Preprocesso…
4    0.000562              8 roc_auc binary     0.850    10 0.00729 Preprocesso…
5    0.000562             11 roc_auc binary     0.845    10 0.00706 Preprocesso…

best_tree <- tree_res %>%
  select_best("roc_auc")

best_tree

# A tibble: 1 × 3
  cost_complexity tree_depth .config              
            <dbl>      <int> <chr>                
1    0.0000000001          8 Preprocessor1_Model11

###best 모델로 모델링하기

final_wf <- 
  tree_wf %>% 
  finalize_workflow(best_tree)

final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
class ~ .

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = 1e-10
  tree_depth = 8

Computational engine: rpart

###훈련데이터를 넣어 best 모델링하기

final_tree <- 
  final_wf %>% 
  fit(data = cell_train)

final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
class ~ .

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = 1e-10
  tree_depth = 8

Computational engine: rpart

###변수별 중요도 보기

library(vip)

Warning: package 'vip' was built under R version 4.3.2


Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

final_tree %>% 
  pull_workflow_fit() %>% 
  vip()

Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
ℹ Please use `extract_fit_parsnip()` instead.

###best 모델로 test 데이터 넣고 예측하기

final_fit <- 
  final_wf %>% 
  # recipe로 나누었던 cell_split에 대하여 Last_fit함수 적용
  last_fit(cells_split)

final_fit %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.772 Preprocessor1_Model1
2 roc_auc  binary         0.847 Preprocessor1_Model1

###roc-curve보기

final_fit %>% 
  # 예측결과 도출
  collect_predictions() %>% 
  # ROC Cure 도출
  roc_curve(class, .pred_PS) %>% 
  # Plot 그리기
  autoplot()

##노모그램 그리기
관련논문

if(!require('rms')) {
  install.packages('rms')
  library('rms')
}

Loading required package: rms

Warning: package 'rms' was built under R version 4.3.2

Loading required package: Hmisc

Warning: package 'Hmisc' was built under R version 4.3.2


Attaching package: 'Hmisc'

The following object is masked from 'package:parsnip':

    translate

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units

Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"ndiMatrix" of class "replValueSp"; definition not updated

n <- 1000    # define sample size
set.seed(17) # so can reproduce the results
d <- data.frame(age = rnorm(n, 50, 10),
                blood.pressure = rnorm(n, 120, 15),
                cholesterol = rnorm(n, 200, 25),
                sex = factor(sample(c('female','male'), n,TRUE)))

# Specify population model for log odds that Y=1
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
d <- upData(d,
            L = .4*(sex=='male') + .045*(age-50) +
              (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')),
            y = ifelse(runif(n) < plogis(L), 1, 0))

Input object size:   29616 bytes;    4 variables     1000 observations
Added variable      L
Added variable      y
New object size:    41856 bytes;    6 variables 1000 observations

ddist <- datadist(d); options(datadist='ddist')


f <- lrm(y ~ lsp(age,50) + sex * rcs(cholesterol, 4) + blood.pressure,
         data=d)
nom <- nomogram(f, fun=function(x)1/(1+exp(-x)),  # or fun=plogis
                fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999),
                funlabel="Risk of Death")
#Instead of fun.at, could have specified fun.lp.at=logit of
#sequence above - faster and slightly more accurate
plot(nom, xfrac=.45)

print(nom)

Points per unit of linear predictor: 25.39362 
Linear predictor units per point   : 0.03937996 


 age Points
 10   0    
 20  14    
 30  27    
 40  41    
 50  54    
 60  61    
 70  67    
 80  74    
 90  80    


 cholesterol (sex=female) Points
 120                      30    
 140                      25    
 160                      21    
 180                      17    
 200                      19    
 220                      20    
 240                      14    
 260                       7    
 280                       0    


 cholesterol (sex=male) Points
 120                     12   
 140                     18   
 160                     25   
 180                     30   
 200                     30   
 220                     35   
 240                     55   
 260                     77   
 280                    100   


 blood.pressure Points
  70            2     
  80            2     
  90            1     
 100            1     
 110            1     
 120            1     
 130            1     
 140            1     
 150            0     
 160            0     
 170            0     


 Total Points Risk of Death
           37           0.2
           51           0.3
           62           0.4
           72           0.5
           83           0.6
           94           0.7
          108           0.8
          128           0.9

nom <- nomogram(f, age=seq(10,90,by=10))
plot(nom, xfrac=.45)

g <- lrm(y ~ sex + rcs(age, 3) * rcs(cholesterol, 3), data=d)
nom <- nomogram(g, interact=list(age=c(20,40,60)), 
                conf.int=c(.7,.9,.95))
plot(nom, col.conf=c(1,.5,.2), naxes=7)

require(survival)

Loading required package: survival

Warning: package 'survival' was built under R version 4.3.2

w <- upData(d,
            cens = 15 * runif(n),
            h = .02 * exp(.04 * (age - 50) + .8 * (sex == 'Female')),
            d.time = -log(runif(n)) / h,
            death = ifelse(d.time <= cens, 1, 0),
            d.time = pmin(d.time, cens))

Input object size:   41856 bytes;    6 variables     1000 observations
Added variable      cens
Added variable      h
Added variable      d.time
Added variable      death
Modified variable   d.time
New object size:    70432 bytes;    10 variables    1000 observations

f <- psm(Surv(d.time,death) ~ sex * age, data=w, dist='lognormal')
med  <- Quantile(f)
surv <- Survival(f)  # This would also work if f was from cph
plot(nomogram(f, fun=function(x) med(lp=x), funlabel="Median Survival Time"))

nom <- nomogram(f, fun=list(function(x) surv(3, x),
                            function(x) surv(6, x)),
                funlabel=c("3-Month Survival Probability", 
                           "6-month Survival Probability"))
plot(nom, xfrac=.7)

if (FALSE) {
  nom <- nomogram(fit.with.categorical.predictors, abbrev=TRUE, minlength=1)
  nom$x1$points   # print points assigned to each level of x1 for its axis
  #Add legend for abbreviations for category levels
  abb <- attr(nom, 'info')$abbrev$treatment
  legend(locator(1), abb$full, pch=paste(abb$abbrev,collapse=''), 
         ncol=2, bty='n')  # this only works for 1-letter abbreviations
  #Or use the legend.nomabbrev function:
  legend.nomabbrev(nom, 'treatment', locator(1), ncol=2, bty='n')
}


#Make a nomogram with axes predicting probabilities Y>=j for all j=1-3
#in an ordinal logistic model, where Y=0,1,2,3
w <- upData(w, Y = ifelse(y==0, 0, sample(1:3, length(y), TRUE)))

Input object size:   70432 bytes;    10 variables    1000 observations
Added variable      Y
New object size:    74536 bytes;    11 variables    1000 observations

g <- lrm(Y ~ age+rcs(cholesterol,4) * sex, data=w)
fun2 <- function(x) plogis(x-g$coef[1]+g$coef[2])
fun3 <- function(x) plogis(x-g$coef[1]+g$coef[3])
f <- Newlabels(g, c(age='Age in Years'))  
#see Design.Misc, which also has Newlevels to change 
#labels for levels of categorical variables
g <- nomogram(f, fun=list('Prob Y>=1'=plogis, 'Prob Y>=2'=fun2, 
                          'Prob Y=3'=fun3), 
              fun.at=c(.01,.05,seq(.1,.9,by=.1),.95,.99))
plot(g, lmgp=.2, cex.axis=.6)

options(datadist=NULL)

library(stargazer)
## Latex 형태 테이블

stargazer(mtcars, type= "text", title= "Summary Statistics", out= "mtcars.text")


Summary Statistics
============================================
Statistic N   Mean   St. Dev.  Min     Max  
--------------------------------------------
mpg       32 20.091   6.027   10.400 33.900 
cyl       32  6.188   1.786     4       8   
disp      32 230.722 123.939  71.100 472.000
hp        32 146.688  68.563    52     335  
drat      32  3.597   0.535   2.760   4.930 
wt        32  3.217   0.978   1.513   5.424 
qsec      32 17.849   1.787   14.500 22.900 
vs        32  0.438   0.504     0       1   
am        32  0.406   0.499     0       1   
gear      32  3.688   0.738     3       5   
carb      32  2.812   1.615     1       8   
--------------------------------------------

## apply a list of functions to a list or vector
f <- function(X, FUN, ...) {
  fn <- as.character(match.call()$FUN)[-1]
  out <- sapply(FUN, mapply, X, ...)
  setNames(as.data.frame(out), fn)
}

(out <- round(f(attitude, list(mean, sd, median, min, max)), 2))

            mean    sd median min max
rating     64.63 12.17   65.5  40  85
complaints 66.60 13.31   65.0  37  90
privileges 53.13 12.24   51.5  30  83
learning   56.37 11.74   56.5  34  75
raises     64.63 10.40   63.5  43  88
critical   74.77  9.89   77.5  49  92
advance    42.93 10.29   41.0  25  72

library('htmlTable')

Warning: package 'htmlTable' was built under R version 4.3.2

htmlTable(out,  cgroup = 'Statistic', n.cgroup = 5, caption = 'Table 1: default')

Table 1: default
	Statistic
	mean	sd	median	min	max
rating	64.63	12.17	65.5	40	85
complaints	66.6	13.31	65	37	90
privileges	53.13	12.24	51.5	30	83
learning	56.37	11.74	56.5	34	75
raises	64.63	10.4	63.5	43	88
critical	74.77	9.89	77.5	49	92
advance	42.93	10.29	41	25	72

htmlTable(out, cgroup = 'Statistic', n.cgroup = 5, caption = 'Table 1: padding',
          ## padding to cells: top side bottom
          css.cell = 'padding: 0px 10px 0px;')

Table 1: padding
	Statistic
	mean	sd	median	min	max
rating	64.63	12.17	65.5	40	85
complaints	66.6	13.31	65	37	90
privileges	53.13	12.24	51.5	30	83
learning	56.37	11.74	56.5	34	75
raises	64.63	10.4	63.5	43	88
critical	74.77	9.89	77.5	49	92
advance	42.93	10.29	41	25	72

문건웅 교수 패키지

테이블

#install.packages("moonBook")
library(moonBook)

Warning: package 'moonBook' was built under R version 4.3.2


Attaching package: 'moonBook'

The following object is masked from 'package:scales':

    comma

data(acs)

mytable(sex~age+Dx, data=acs)


         Descriptive Statistics by 'sex'         
—————————————————————————————————————————————————— 
                       Female       Male       p  
                       (N=287)     (N=570)  
—————————————————————————————————————————————————— 
 age                 68.7 ± 10.7 60.6 ± 11.2 0.000
 Dx                                          0.012
   - NSTEMI          50 (17.4%)  103 (18.1%)      
   - STEMI           84 (29.3%)  220 (38.6%)      
   - Unstable Angina 153 (53.3%) 247 (43.3%)      
——————————————————————————————————————————————————

테이블 만들기

str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

#mytable 함수에서는 고유값이 5개 미만인 경우 숫자로 입력되어 있더라도 연속형 변수가 아닌 범주형 변수로 취급합니다.
mytable(am~., data=mtcars)


    Descriptive Statistics by 'am'    
——————————————————————————————————————— 
             0            1         p  
          (N=19)        (N=13)   
——————————————————————————————————————— 
 mpg    17.1 ±  3.8  24.4 ±  6.2  0.000
 cyl                              0.013
   - 4   3 (15.8%)    8 (61.5%)        
   - 6   4 (21.1%)    3 (23.1%)        
   - 8  12 (63.2%)    2 (15.4%)        
 disp  290.4 ± 110.2 143.5 ± 87.2 0.000
 hp    160.3 ± 53.9  126.8 ± 84.1 0.180
 drat    3.3 ±  0.4   4.0 ±  0.4  0.000
 wt      3.8 ±  0.8   2.4 ±  0.6  0.000
 qsec   18.2 ±  1.8  17.4 ±  1.8  0.206
 vs                               0.556
   - 0  12 (63.2%)    6 (46.2%)        
   - 1   7 (36.8%)    7 (53.8%)        
 gear                             0.000
   - 3  15 (78.9%)     0 ( 0.0%)       
   - 4   4 (21.1%)    8 (61.5%)        
   - 5   0 ( 0.0%)    5 (38.5%)        
 carb    2.7 ±  1.1   2.9 ±  2.2  0.781
———————————————————————————————————————

#carb  범주형 변수로 취급하고 싶다면 [max.ylev]의 인수를 바꿔주면 됩니다.
mytable(am~carb, data=mtcars,max.ylev=6)


  Descriptive Statistics by 'am' 
—————————————————————————————————— 
           0          1        p  
         (N=19)     (N=13)  
—————————————————————————————————— 
 carb                        0.284
   - 1 3 (15.8%)  4 (30.8%)       
   - 2 6 (31.6%)  4 (30.8%)       
   - 3 3 (15.8%)   0 ( 0.0%)      
   - 4 7 (36.8%)  3 (23.1%)       
   - 6  0 ( 0.0%) 1 ( 7.7%)       
   - 8  0 ( 0.0%) 1 ( 7.7%)       
——————————————————————————————————

###1.4 여러개의 표를 하나로 합치기
# 전체 환자를 성별에 따라 남, 여로 나누고 다시 당뇨병 유무에 따라 표를 나누어 만든 후 표를 합치고 싶으면 다음과 같이 합니다.
mytable(sex+DM~., data=acs)


                 Descriptive Statistics stratified by 'sex' and 'DM'                
————————————————————————————————————————————————————————————————————————————————————— 
                                    Male                             Female             
                     ———————————————————————————————— ——————————————————————————————— 
                          No           Yes        p        No          Yes        p  
                       (N=380)       (N=190)           (N=173)      (N=114)        
————————————————————————————————————————————————————————————————————————————————————— 
 age                 60.9 ± 11.5   60.1 ± 10.6  0.472 69.3 ± 11.4  67.8 ±  9.7  0.257
 cardiogenicShock                               0.685                           0.296
   - No              355 (93.4%)   175 (92.1%)        168 (97.1%)  107 (93.9%)       
   - Yes              25 ( 6.6%)   15 ( 7.9%)          5 ( 2.9%)    7 ( 6.1%)        
 entry                                          0.552                           0.665
   - Femoral         125 (32.9%)   68 (35.8%)          74 (42.8%)   45 (39.5%)       
   - Radial          255 (67.1%)   122 (64.2%)         99 (57.2%)   69 (60.5%)       
 Dx                                             0.219                           0.240
   - NSTEMI           71 (18.7%)   32 (16.8%)          25 (14.5%)   25 (21.9%)       
   - STEMI           154 (40.5%)   66 (34.7%)          54 (31.2%)   30 (26.3%)       
   - Unstable Angina 155 (40.8%)   92 (48.4%)          94 (54.3%)   59 (51.8%)       
 EF                  56.5 ±  8.3   53.9 ± 11.0  0.007 56.0 ± 10.1  56.6 ± 10.0  0.655
 height              168.1 ±  5.8 167.5 ±  6.7  0.386 153.9 ±  6.5 153.6 ±  5.8 0.707
 weight              68.1 ± 10.4   69.8 ± 10.2  0.070 56.5 ±  8.7  58.4 ± 10.0  0.106
 BMI                 24.0 ±  3.1   24.9 ±  3.5  0.005 23.8 ±  3.2  24.8 ±  4.0  0.046
 obesity                                        0.027                           0.359
   - No              261 (68.7%)   112 (58.9%)        121 (69.9%)   73 (64.0%)       
   - Yes             119 (31.3%)   78 (41.1%)          52 (30.1%)   41 (36.0%)       
 TC                  184.1 ± 46.7 181.8 ± 44.5  0.572 186.0 ± 43.1 193.3 ± 60.8 0.274
 LDLC                117.9 ± 41.8 112.1 ± 39.4  0.115 116.3 ± 35.2 119.8 ± 48.6 0.519
 HDLC                38.4 ± 11.4   36.8 ±  9.6  0.083 39.2 ± 10.9  38.8 ± 12.2  0.821
 TG                  115.2 ± 72.2 153.4 ± 130.7 0.000 114.2 ± 82.4 128.4 ± 65.5 0.112
 HBP                                            0.000                           0.356
   - No              205 (53.9%)   68 (35.8%)          54 (31.2%)   29 (25.4%)       
   - Yes             175 (46.1%)   122 (64.2%)        119 (68.8%)   85 (74.6%)       
 smoking                                        0.386                           0.093
   - Ex-smoker       101 (26.6%)   54 (28.4%)          34 (19.7%)   15 (13.2%)       
   - Never            77 (20.3%)   46 (24.2%)         118 (68.2%)   91 (79.8%)       
   - Smoker          202 (53.2%)   90 (47.4%)          21 (12.1%)   8 ( 7.0%)        
—————————————————————————————————————————————————————————————————————————————————————

#1.5 csv 파일로 저장하기
mycsv(mytable(sex+DM~., data=acs),file='test1.csv')
mycsv(mytable(sex+DM~age+Dx, data=acs),file='test2.csv')


#1.6 html 형식으로 표를 출력하기
# myhtml(mytable(sex+DM~., data=acs))
# myhtml(mytable(sex+DM~age+Dx, data=acs))

히스토그램

require(UsingR)

Loading required package: UsingR

Warning: package 'UsingR' was built under R version 4.3.2

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Loading required package: HistData

Warning: package 'HistData' was built under R version 4.3.2


Attaching package: 'UsingR'

The following object is masked from 'package:survival':

    cancer

data(galton)
str(galton)

'data.frame':   928 obs. of  2 variables:
 $ child : num  61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...
 $ parent: num  70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...

par(mfrow=c(1,2))
hist(galton$child, col="blue", breaks = 100)
hist(galton$parent, col="blue", breaks = 100)

##상관관계

par(mfrow=c(1,1)) # Default
cor.test(galton$child, galton$parent)


    Pearson's product-moment correlation

data:  galton$child and galton$parent
t = 15.711, df = 926, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4064067 0.5081153
sample estimates:
      cor 
0.4587624

#부모와 아이 키 상관표
xtabs(~child+parent, data=galton)

      parent
child  64 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5 73
  61.7  1    1    1    0    0    1    0    1    0    0  0
  62.2  0    1    0    3    3    0    0    0    0    0  0
  63.2  2    4    9    3    5    7    1    1    0    0  0
  64.2  4    4    5    5   14   11   16    0    0    0  0
  65.2  1    1    7    2   15   16    4    1    1    0  0
  66.2  2    5   11   17   36   25   17    1    3    0  0
  67.2  2    5   11   17   38   31   27    3    4    0  0
  68.2  1    0    7   14   28   34   20   12    3    1  0
  69.2  1    2    7   13   38   48   33   18    5    2  0
  70.2  0    0    5    4   19   21   25   14   10    1  0
  71.2  0    0    2    0   11   18   20    7    4    2  0
  72.2  0    0    1    0    4    4   11    4    9    7  1
  73.2  0    0    0    0    0    3    4    3    2    2  3
  73.7  0    0    0    0    0    0    5    3    2    4  0

##회귀분석

out = lm(child~parent, data=galton)
summary(out)


Call:
lm(formula = child ~ parent, data = galton)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8050 -1.3661  0.0487  1.6339  5.9264 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 23.94153    2.81088   8.517   <2e-16 ***
parent       0.64629    0.04114  15.711   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.239 on 926 degrees of freedom
Multiple R-squared:  0.2105,    Adjusted R-squared:  0.2096 
F-statistic: 246.8 on 1 and 926 DF,  p-value: < 2.2e-16

plot(child~parent, data=galton)
abline(out,col='red')

## 단순회귀 분석
data(women)
str(women)

'data.frame':   15 obs. of  2 variables:
 $ height: num  58 59 60 61 62 63 64 65 66 67 ...
 $ weight: num  115 117 120 123 126 129 132 135 139 142 ...

fit = lm(weight~height, data=women)
summary(fit)


Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

plot(weight~height, data=women)
abline(fit, col='blue')

fit2 = lm(weight~height+I(height^2), data=women)
summary(fit2)


Call:
lm(formula = weight ~ height + I(height^2), data = women)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50941 -0.29611 -0.00941  0.28615  0.59706 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
height       -7.34832    0.77769  -9.449 6.58e-07 ***
I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-squared:  0.9995,    Adjusted R-squared:  0.9994 
F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16

plot(weight~height,data=women)
lines(women$height, fitted(fit2), col="red")

scatterplot

require(car)

Loading required package: car

Loading required package: carData


Attaching package: 'car'

The following objects are masked from 'package:rms':

    Predict, vif

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

scatterplot(weight~height, data=women)

scatterplot(weight~height, data=women, pch=19, # pch=점을 속이 차있는 circle
            spread=FALSE, smoother.args=list(lty=2),
            main="Women Age 30~39",
            xlab="Height(inches)", ylab="Weight(lbs.)")

Warning in plot.window(...): "spread" is not a graphical parameter

Warning in plot.window(...): "smoother.args" is not a graphical parameter

Warning in plot.xy(xy, type, ...): "spread" is not a graphical parameter

Warning in plot.xy(xy, type, ...): "smoother.args" is not a graphical parameter

Warning in axis(side = side, at = at, labels = labels, ...): "spread" is not a
graphical parameter

Warning in axis(side = side, at = at, labels = labels, ...): "smoother.args" is
not a graphical parameter

Warning in axis(side = side, at = at, labels = labels, ...): "spread" is not a
graphical parameter

Warning in axis(side = side, at = at, labels = labels, ...): "smoother.args" is
not a graphical parameter

Warning in box(...): "spread" is not a graphical parameter

Warning in box(...): "smoother.args" is not a graphical parameter

Warning in title(...): "spread" is not a graphical parameter

Warning in title(...): "smoother.args" is not a graphical parameter

다중 회귀

require(MASS)
str(birthwt)

'data.frame':   189 obs. of  10 variables:
 $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
 $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
 $ race : int  2 3 1 1 1 3 1 3 1 1 ...
 $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
 $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
 $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
 $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

# age(엄마의 나이), lwt(마지막 생리기간의 엄마의 몸무게), race(인종), smoke(임신 중 흡연상태), ptl(과거 조산 횟수), ht(고혈압의 기왕력), ui(uterine irritability 존재 여부)
out=lm(bwt~age+lwt+factor(race)+smoke+ptl+ht+ui,data=birthwt)
anova(out)

Analysis of Variance Table

Response: bwt
              Df   Sum Sq Mean Sq F value    Pr(>F)    
age            1   815483  815483  1.9380 0.1656022    
lwt            1  2967339 2967339  7.0519 0.0086284 ** 
factor(race)   2  4750632 2375316  5.6450 0.0041901 ** 
smoke          1  6291918 6291918 14.9529 0.0001538 ***
ptl            1   732501  732501  1.7408 0.1887130    
ht             1  2852764 2852764  6.7796 0.0099900 ** 
ui             1  5817995 5817995 13.8266 0.0002676 ***
Residuals    180 75741025  420783                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# coefplot 패키지를 이용하면 계수 플롯을 만들 수 있습니다.
# 신뢰구간이 0을 포함하지 않으면 통계적으로 유의하다고 볼 수 있습니다.
library(coefplot)

Warning: package 'coefplot' was built under R version 4.3.2

coefplot(out)

#분산분석표에서 age와 ptl에 유의하지 않으므로 두 변수를 제거한 모형을 만들고 두 변수의 중요도를 평가하기 위해 F-test를 실시합니다. F-test는 두 모형의 설명력을 비교하여 첫번째 모형에서 제거된 변수들의 기여도를 평가한다고 생각하면 됩니다. anova(작은모형, 큰모형)

out2=lm(bwt~lwt+factor(race)+smoke+ht+ui,data=birthwt)
anova(out2,out)

Analysis of Variance Table

Model 1: bwt ~ lwt + factor(race) + smoke + ht + ui
Model 2: bwt ~ age + lwt + factor(race) + smoke + ptl + ht + ui
  Res.Df      RSS Df Sum of Sq      F Pr(>F)
1    182 75937505                           
2    180 75741025  2    196480 0.2335  0.792

#F-test 결과 p-value가 0.792로 두 변수를 모두 제거할 수 있습니다.
anova(out2)

Analysis of Variance Table

Response: bwt
              Df   Sum Sq Mean Sq F value    Pr(>F)    
lwt            1  3448639 3448639  8.2654 0.0045226 ** 
factor(race)   2  5076610 2538305  6.0836 0.0027701 ** 
smoke          1  6281818 6281818 15.0557 0.0001458 ***
ht             1  2871867 2871867  6.8830 0.0094402 ** 
ui             1  6353218 6353218 15.2268 0.0001341 ***
Residuals    182 75937505  417239                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(out2)


Call:
lm(formula = bwt ~ lwt + factor(race) + smoke + ht + ui, data = birthwt)

Residuals:
     Min       1Q   Median       3Q      Max 
-1842.14  -433.19    67.09   459.21  1631.03 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2837.264    243.676  11.644  < 2e-16 ***
lwt              4.242      1.675   2.532 0.012198 *  
factor(race)2 -475.058    145.603  -3.263 0.001318 ** 
factor(race)3 -348.150    112.361  -3.099 0.002254 ** 
smoke         -356.321    103.444  -3.445 0.000710 ***
ht            -585.193    199.644  -2.931 0.003810 ** 
ui            -525.524    134.675  -3.902 0.000134 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 645.9 on 182 degrees of freedom
Multiple R-squared:  0.2404,    Adjusted R-squared:  0.2154 
F-statistic:   9.6 on 6 and 182 DF,  p-value: 3.601e-09

상호작용이 있는 다중 회귀

data(mtcars)
fit=lm(mpg~hp*wt, data=mtcars)
summary(fit)


Call:
lm(formula = mpg ~ hp * wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0632 -1.6491 -0.7362  1.4211  4.5513 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
hp          -0.12010    0.02470  -4.863 4.04e-05 ***
wt          -8.21662    1.26971  -6.471 5.20e-07 ***
hp:wt        0.02785    0.00742   3.753 0.000811 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared:  0.8848,    Adjusted R-squared:  0.8724 
F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

plot(mpg~hp, data=mtcars, main="Interaction of hp:wt")
curve(31.41-0.06*x,add=TRUE)
curve(23.37-0.03*x,add=TRUE,lty=2,col=2)
curve(15.33-0.003*x,add=TRUE,lty=3,col=3)
legend("topright",c("2.2","3.2","4.2"),title="wt",lty=1:3,col=1:3)

생존분석

참고

rm(list=ls())
library(mombf)

Warning: package 'mombf' was built under R version 4.3.2

Loading required package: mvtnorm

Warning: package 'mvtnorm' was built under R version 4.3.2

Loading required package: ncvreg

Warning: package 'ncvreg' was built under R version 4.3.2

Loading required package: mgcv

Warning: package 'mgcv' was built under R version 4.3.2

Loading required package: nlme

Warning: package 'nlme' was built under R version 4.3.2


Attaching package: 'nlme'

The following object is masked from 'package:HistData':

    Wheat

The following object is masked from 'package:dplyr':

    collapse

This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.

library(survival)
library(xtable)


Attaching package: 'xtable'

The following objects are masked from 'package:Hmisc':

    label, label<-

library(ggplot2)


require(survival)
require(maxstat)

Loading required package: maxstat

Warning: package 'maxstat' was built under R version 4.3.2

colon <- read.csv("colon.csv")

fit = survfit(Surv(time,status==1)~1, data=colon)
fit

Call: survfit(formula = Surv(time, status == 1) ~ 1, data = colon)

        n events median 0.95LCL 0.95UCL
[1,] 1858    920   2351    2018    2910

#colon 데이터에서 time은 생존기간을 뜻하며 status가 1은 사망한 경우이고 0은 현재 살아있는 경우 입니다.
#Kaplan-Meier Curve 가운데 선이 KM curve이고 점선은 95% 신뢰구간입니다
plot(fit) # KM curve

# rx열은 환자를 치료군에 따라 세군으로 나눈 데이터 입니다. 치료에 따른 생존율의 차이를 보기 위해서는 survfit 함수를 사용할때 “~rx”를 넣어줍니다.


fit1 = survfit(Surv(time,status==1)~rx,data=colon)
plot(fit1,col=1:3, lty=1:3)
legend("topright", legend=unique(colon$rx), col=1:3, lty=1:3)

plot(fit1,col=1:3, lty=1:3, mark.time=TRUE)
legend("topright", legend=unique(colon$rx), col=1:3, lty=1:3)

#누적함수 cumulative hazard
plot(fit1,col=1:3, lty=1:3, fun="cumhaz", xlab="Days")
legend("topright", legend=unique(colon$rx), col=1:3, lty=1:3)

require(GGally)

Loading required package: GGally

Warning: package 'GGally' was built under R version 4.3.2

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

fit = survfit(Surv(time,status==1)~rx, data=colon)
legend("topright", legend=unique(colon$rx), col=1:3, lty=1:3)