En este informe construyo un modelo de aprendizaje automático para
predecir la variable classe
(forma en que
se realizó el ejercicio) usando los datos de acelerómetros del proyecto
Practical Machine Learning. Sigo un flujo reproducible:
descarga de datos, limpieza y preprocesamiento, partición
train/valid, validación cruzada, comparación de modelos,
estimación del error fuera de muestra y generación de predicciones para
los 20 casos de prueba.
pkgs <- c("tidyverse","caret","randomForest","gbm","doParallel","janitor","vip")
need <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if(length(need)) install.packages(need, repos = "https://cloud.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/02/xk0cn3pd0rb_ldnrsdrh04j40000gn/T//RtmpJMTLFZ/downloaded_packages
invisible(lapply(pkgs, require, character.only = TRUE))
Descargo los conjuntos training y testing oficiales
y los leo tratando como NA a "NA"
, "#DIV/0!"
y
cadenas vacías.
train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train_raw <- read.csv(train_url, na.strings = c("NA", "#DIV/0!", ""))
test_raw <- read.csv(test_url, na.strings = c("NA", "#DIV/0!", ""))
dim(train_raw); dim(test_raw)
## [1] 19622 160
## [1] 20 160
El objetivo es quedarnos con predictores numéricos informativos
que existan tanto en train
como en
test
.
Elimino columnas no relacionadas al sensor (identificadores/tiempo/ventanas) y las de casi varianza-cero.
Elimino columnas con proporción de NA ≥ 95%.
Conservo solo variables presentes en ambos conjuntos.
Escalo/centro predictores numéricos con parámetros aprendidos en el set de entrenamiento.
# 1) Eliminar columnas no predictoras obvias
DROP <- c("X","user_name","raw_timestamp_part_1","raw_timestamp_part_2",
"cvtd_timestamp","new_window","num_window")
train <- train_raw %>% dplyr::select(-any_of(DROP))
test <- test_raw %>% dplyr::select(-any_of(DROP))
# 2) Columnas con demasiados NA (≥95%)
na_rate <- sapply(train, function(x) mean(is.na(x)))
keep_na <- names(na_rate)[na_rate < 0.95]
train <- train[, keep_na]
# Asegurar que 'classe' siga presente
stopifnot("classe" %in% names(train))
# 3) Mantener intersección de nombres entre train y test
common <- intersect(names(train), names(test))
# Nota: 'classe' no existe en test, la agregamos manualmente al conjunto de entrenamiento
train <- train %>% dplyr::select(any_of(c(common, "classe")))
test <- test %>% dplyr::select(any_of(common))
# Convertir outcome a factor
train$classe <- factor(train$classe)
# Near-zero variance (sobre train)
nzv <- nearZeroVar(train %>% dplyr::select(-classe), saveMetrics = TRUE)
train <- train[, c(rownames(nzv)[!nzv$nzv], "classe")]
test <- test[, rownames(nzv)[!nzv$nzv]]
# 4) Center/scale sobre columnas numéricas
num_cols <- names(train)[sapply(train, is.numeric)]
pp <- preProcess(train[, num_cols], method = c("center","scale"))
train[, num_cols] <- predict(pp, train[, num_cols])
test[, num_cols] <- predict(pp, test[, num_cols])
# Vista rápida
list(
n_predictors = ncol(train) - 1,
n_rows_train = nrow(train),
n_rows_test = nrow(test)
)
## $n_predictors
## [1] 52
##
## $n_rows_train
## [1] 19622
##
## $n_rows_test
## [1] 20
Uso una partición estratificada 70/30 para estimar el error fuera de muestra.
set.seed(20250904)
split <- createDataPartition(train$classe, p = 0.7, list = FALSE)
train_set <- train[split, ]
valid_set <- train[-split, ]
dim(train_set); dim(valid_set)
## [1] 13737 53
## [1] 5885 53
Entreno y comparo dos modelos fuertes para este tipo de datos: Random Forest (RF) y Gradient Boosting (GBM). Ambos con validación cruzada de 5 folds.
ctrl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
cl <- parallel::makePSOCKcluster(max(1, parallel::detectCores() - 1))
doParallel::registerDoParallel(cl)
set.seed(20250904)
rf_fit <- train(classe ~ ., data = train_set,
method = "rf", trControl = ctrl,
ntree = 500, importance = TRUE)
set.seed(20250904)
gbm_fit <- train(classe ~ ., data = train_set,
method = "gbm", trControl = ctrl,
verbose = FALSE)
stopCluster(cl); registerDoSEQ()
rf_fit; gbm_fit
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10988, 10990, 10988, 10991, 10991
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9909002 0.9884882
## 27 0.9886435 0.9856334
## 52 0.9802706 0.9750393
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Stochastic Gradient Boosting
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10988, 10990, 10988, 10991, 10991
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7503081 0.6834491
## 1 100 0.8215775 0.7741082
## 1 150 0.8552087 0.8167779
## 2 50 0.8557174 0.8172028
## 2 100 0.9079850 0.8835427
## 2 150 0.9310602 0.9127612
## 3 50 0.8959017 0.8681665
## 3 100 0.9400890 0.9241912
## 3 150 0.9584322 0.9474103
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
res <- resamples(list(RF = rf_fit, GBM = gbm_fit))
summary(res)
##
## Call:
## summary.resamples(object = res)
##
## Models: RF, GBM
## Number of resamples: 5
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.9894392 0.9901675 0.9909058 0.9909002 0.9919913 0.9919971 0
## GBM 0.9522942 0.9566642 0.9596217 0.9584322 0.9614405 0.9621405 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.9866396 0.9875592 0.9884963 0.9884882 0.9898695 0.9898763 0
## GBM 0.9396448 0.9451826 0.9489032 0.9474103 0.9511961 0.9521250 0
lattice::bwplot(res)
rf_pred_v <- predict(rf_fit, valid_set)
cm_rf <- confusionMatrix(rf_pred_v, valid_set$classe)
cm_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 5 0 0 0
## B 0 1132 6 0 0
## C 0 2 1020 13 0
## D 0 0 0 951 1
## E 0 0 0 0 1081
##
## Overall Statistics
##
## Accuracy : 0.9954
## 95% CI : (0.9933, 0.997)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9942
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9939 0.9942 0.9865 0.9991
## Specificity 0.9988 0.9987 0.9969 0.9998 1.0000
## Pos Pred Value 0.9970 0.9947 0.9855 0.9989 1.0000
## Neg Pred Value 1.0000 0.9985 0.9988 0.9974 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1924 0.1733 0.1616 0.1837
## Detection Prevalence 0.2853 0.1934 0.1759 0.1618 0.1837
## Balanced Accuracy 0.9994 0.9963 0.9955 0.9932 0.9995
err_rf <- 1 - cm_rf$overall["Accuracy"]
gbm_pred_v <- predict(gbm_fit, valid_set)
cm_gbm <- confusionMatrix(gbm_pred_v, valid_set$classe)
cm_gbm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1649 35 0 0 4
## B 18 1079 29 5 5
## C 4 25 980 40 4
## D 3 0 16 908 10
## E 0 0 1 11 1059
##
## Overall Statistics
##
## Accuracy : 0.9643
## 95% CI : (0.9593, 0.9689)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9548
##
## Mcnemar's Test P-Value : 2.662e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9851 0.9473 0.9552 0.9419 0.9787
## Specificity 0.9907 0.9880 0.9850 0.9941 0.9975
## Pos Pred Value 0.9769 0.9498 0.9307 0.9691 0.9888
## Neg Pred Value 0.9940 0.9874 0.9905 0.9887 0.9952
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2802 0.1833 0.1665 0.1543 0.1799
## Detection Prevalence 0.2868 0.1930 0.1789 0.1592 0.1820
## Balanced Accuracy 0.9879 0.9677 0.9701 0.9680 0.9881
err_gbm <- 1 - cm_gbm$overall["Accuracy"]
best_name <- ifelse(cm_rf$overall["Accuracy"] >= cm_gbm$overall["Accuracy"], "RF", "GBM")
err_best <- ifelse(best_name == "RF", err_rf, err_gbm)
knitr::kable(data.frame(Modelo = c("RF","GBM"),
Accuracy = c(cm_rf$overall["Accuracy"], cm_gbm$overall["Accuracy"]),
ErrorFueraDeMuestra_Estimado = c(err_rf, err_gbm)))
Modelo | Accuracy | ErrorFueraDeMuestra_Estimado |
---|---|---|
RF | 0.9954121 | 0.0045879 |
GBM | 0.9643161 | 0.0356839 |
if (best_name == "RF") {
vip::vip(rf_fit$finalModel, num_features = 20)
} else {
vip::vip(gbm_fit$finalModel, num_features = 20)
}
user_name
, timestamps, ventanas) por no aportar
señal generalizable.Uso el mejor modelo para predecir los 20 casos de
pml-testing.csv
y escribir los archivos
problem_id_#.txt
requeridos por el cuestionario.
# Elegir el objeto de modelo ganador
best_fit <- if (best_name == "RF") rf_fit else gbm_fit
# Predicciones para los 20 casos
pred_test <- predict(best_fit, test)
pred_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# Función para escribir archivos problem_id_#.txt
pml_write_files <- function(x) {
for (i in seq_along(x)) {
fn <- paste0("problem_id_", i, ".txt")
write.table(x[i], file = fn, quote = FALSE, row.names = FALSE, col.names = FALSE)
}
}
# Descomente para generar archivos en su carpeta de trabajo
# pml_write_files(pred_test)
Como estimación del error fuera de muestra, reporto el error del modelo ganador en el validation set (30% separado):
err_best
## Accuracy
## 0.004587935
Adicionalmente, la Accuracy de validación cruzada (5-fold) del modelo seleccionado sugiere un rendimiento similar, lo que reduce el riesgo de sobreajuste.
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.3.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Santiago
## tzcode source: internal
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] vip_0.4.1 janitor_2.2.1 doParallel_1.0.17
## [4] iterators_1.0.14 foreach_1.5.2 gbm_2.2.2
## [7] randomForest_4.7-1.2 caret_7.0-1 lattice_0.22-7
## [10] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
## [13] dplyr_1.1.4 purrr_1.1.0 readr_2.1.5
## [16] tidyr_1.3.1 tibble_3.3.0 ggplot2_3.5.2
## [19] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.53 bslib_0.9.0
## [4] recipes_1.3.1 tzdb_0.5.0 vctrs_0.6.5
## [7] tools_4.5.1 generics_0.1.4 stats4_4.5.1
## [10] proxy_0.4-27 pkgconfig_2.0.3 ModelMetrics_1.2.2.2
## [13] Matrix_1.7-3 data.table_1.17.8 RColorBrewer_1.1-3
## [16] lifecycle_1.0.4 compiler_4.5.1 farver_2.1.2
## [19] codetools_0.2-20 snakecase_0.11.1 htmltools_0.5.8.1
## [22] class_7.3-23 sass_0.4.10 yaml_2.3.10
## [25] prodlim_2025.04.28 pillar_1.11.0 jquerylib_0.1.4
## [28] MASS_7.3-65 cachem_1.1.0 gower_1.0.2
## [31] rpart_4.1.24 nlme_3.1-168 parallelly_1.45.1
## [34] lava_1.8.1 tidyselect_1.2.1 digest_0.6.37
## [37] stringi_1.8.7 future_1.67.0 reshape2_1.4.4
## [40] listenv_0.9.1 labeling_0.4.3 splines_4.5.1
## [43] fastmap_1.2.0 grid_4.5.1 cli_3.6.5
## [46] magrittr_2.0.3 survival_3.8-3 e1071_1.7-16
## [49] future.apply_1.20.0 withr_3.0.2 scales_1.4.0
## [52] timechange_0.3.0 rmarkdown_2.29 globals_0.18.0
## [55] nnet_7.3-20 timeDate_4041.110 hms_1.1.3
## [58] evaluate_1.0.4 knitr_1.50 hardhat_1.4.2
## [61] rlang_1.1.6 Rcpp_1.1.0 glue_1.8.0
## [64] pROC_1.19.0.1 ipred_0.9-15 rstudioapi_0.17.1
## [67] jsonlite_2.0.0 R6_2.6.1 plyr_1.8.9