Resumen

En este informe construyo un modelo de aprendizaje automático para predecir la variable classe (forma en que se realizó el ejercicio) usando los datos de acelerómetros del proyecto Practical Machine Learning. Sigo un flujo reproducible: descarga de datos, limpieza y preprocesamiento, partición train/valid, validación cruzada, comparación de modelos, estimación del error fuera de muestra y generación de predicciones para los 20 casos de prueba.

Configuración

Paquetes

pkgs <- c("tidyverse","caret","randomForest","gbm","doParallel","janitor","vip")
need <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if(length(need)) install.packages(need, repos = "https://cloud.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/02/xk0cn3pd0rb_ldnrsdrh04j40000gn/T//RtmpJMTLFZ/downloaded_packages
invisible(lapply(pkgs, require, character.only = TRUE))

Datos

Descargo los conjuntos training y testing oficiales y los leo tratando como NA a "NA", "#DIV/0!" y cadenas vacías.

train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url  <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train_raw <- read.csv(train_url, na.strings = c("NA", "#DIV/0!", ""))
test_raw  <- read.csv(test_url,  na.strings = c("NA", "#DIV/0!", ""))
dim(train_raw); dim(test_raw)
## [1] 19622   160
## [1]  20 160

Preprocesamiento

El objetivo es quedarnos con predictores numéricos informativos que existan tanto en train como en test.

  1. Elimino columnas no relacionadas al sensor (identificadores/tiempo/ventanas) y las de casi varianza-cero.

  2. Elimino columnas con proporción de NA ≥ 95%.

  3. Conservo solo variables presentes en ambos conjuntos.

  4. Escalo/centro predictores numéricos con parámetros aprendidos en el set de entrenamiento.

# 1) Eliminar columnas no predictoras obvias
DROP <- c("X","user_name","raw_timestamp_part_1","raw_timestamp_part_2",
          "cvtd_timestamp","new_window","num_window")
train <- train_raw %>% dplyr::select(-any_of(DROP))
test  <- test_raw  %>% dplyr::select(-any_of(DROP))

# 2) Columnas con demasiados NA (≥95%)
na_rate <- sapply(train, function(x) mean(is.na(x)))
keep_na <- names(na_rate)[na_rate < 0.95]
train <- train[, keep_na]
# Asegurar que 'classe' siga presente
stopifnot("classe" %in% names(train))

# 3) Mantener intersección de nombres entre train y test
common <- intersect(names(train), names(test))
# Nota: 'classe' no existe en test, la agregamos manualmente al conjunto de entrenamiento
train <- train %>% dplyr::select(any_of(c(common, "classe")))
test  <- test  %>% dplyr::select(any_of(common))

# Convertir outcome a factor
train$classe <- factor(train$classe)

# Near-zero variance (sobre train)
nzv <- nearZeroVar(train %>% dplyr::select(-classe), saveMetrics = TRUE)
train <- train[, c(rownames(nzv)[!nzv$nzv], "classe")]
test  <- test[, rownames(nzv)[!nzv$nzv]]

# 4) Center/scale sobre columnas numéricas
num_cols <- names(train)[sapply(train, is.numeric)]
pp <- preProcess(train[, num_cols], method = c("center","scale"))
train[, num_cols] <- predict(pp, train[, num_cols])
test[,  num_cols] <- predict(pp, test[,  num_cols])

# Vista rápida
list(
  n_predictors = ncol(train) - 1,
  n_rows_train = nrow(train),
  n_rows_test  = nrow(test)
)
## $n_predictors
## [1] 52
## 
## $n_rows_train
## [1] 19622
## 
## $n_rows_test
## [1] 20

Partición de datos

Uso una partición estratificada 70/30 para estimar el error fuera de muestra.

set.seed(20250904)
split <- createDataPartition(train$classe, p = 0.7, list = FALSE)
train_set <- train[split, ]
valid_set <- train[-split, ]
dim(train_set); dim(valid_set)
## [1] 13737    53
## [1] 5885   53

Modelado y validación cruzada

Entreno y comparo dos modelos fuertes para este tipo de datos: Random Forest (RF) y Gradient Boosting (GBM). Ambos con validación cruzada de 5 folds.

ctrl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
cl <- parallel::makePSOCKcluster(max(1, parallel::detectCores() - 1))
doParallel::registerDoParallel(cl)

set.seed(20250904)
rf_fit <- train(classe ~ ., data = train_set,
                method = "rf", trControl = ctrl,
                ntree = 500, importance = TRUE)

set.seed(20250904)
gbm_fit <- train(classe ~ ., data = train_set,
                 method = "gbm", trControl = ctrl,
                 verbose = FALSE)

stopCluster(cl); registerDoSEQ()

rf_fit; gbm_fit
## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10988, 10990, 10988, 10991, 10991 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9909002  0.9884882
##   27    0.9886435  0.9856334
##   52    0.9802706  0.9750393
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Stochastic Gradient Boosting 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10988, 10990, 10988, 10991, 10991 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7503081  0.6834491
##   1                  100      0.8215775  0.7741082
##   1                  150      0.8552087  0.8167779
##   2                   50      0.8557174  0.8172028
##   2                  100      0.9079850  0.8835427
##   2                  150      0.9310602  0.9127612
##   3                   50      0.8959017  0.8681665
##   3                  100      0.9400890  0.9241912
##   3                  150      0.9584322  0.9474103
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

Comparación por resamples

res <- resamples(list(RF = rf_fit, GBM = gbm_fit))
summary(res)
## 
## Call:
## summary.resamples(object = res)
## 
## Models: RF, GBM 
## Number of resamples: 5 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.9894392 0.9901675 0.9909058 0.9909002 0.9919913 0.9919971    0
## GBM 0.9522942 0.9566642 0.9596217 0.9584322 0.9614405 0.9621405    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.9866396 0.9875592 0.9884963 0.9884882 0.9898695 0.9898763    0
## GBM 0.9396448 0.9451826 0.9489032 0.9474103 0.9511961 0.9521250    0
lattice::bwplot(res)

Evaluación en el conjunto de validación

rf_pred_v <- predict(rf_fit, valid_set)
cm_rf <- confusionMatrix(rf_pred_v, valid_set$classe)
cm_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    5    0    0    0
##          B    0 1132    6    0    0
##          C    0    2 1020   13    0
##          D    0    0    0  951    1
##          E    0    0    0    0 1081
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9954         
##                  95% CI : (0.9933, 0.997)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9942         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9939   0.9942   0.9865   0.9991
## Specificity            0.9988   0.9987   0.9969   0.9998   1.0000
## Pos Pred Value         0.9970   0.9947   0.9855   0.9989   1.0000
## Neg Pred Value         1.0000   0.9985   0.9988   0.9974   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1924   0.1733   0.1616   0.1837
## Detection Prevalence   0.2853   0.1934   0.1759   0.1618   0.1837
## Balanced Accuracy      0.9994   0.9963   0.9955   0.9932   0.9995
err_rf <- 1 - cm_rf$overall["Accuracy"]

gbm_pred_v <- predict(gbm_fit, valid_set)
cm_gbm <- confusionMatrix(gbm_pred_v, valid_set$classe)
cm_gbm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1649   35    0    0    4
##          B   18 1079   29    5    5
##          C    4   25  980   40    4
##          D    3    0   16  908   10
##          E    0    0    1   11 1059
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9643          
##                  95% CI : (0.9593, 0.9689)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9548          
##                                           
##  Mcnemar's Test P-Value : 2.662e-05       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9851   0.9473   0.9552   0.9419   0.9787
## Specificity            0.9907   0.9880   0.9850   0.9941   0.9975
## Pos Pred Value         0.9769   0.9498   0.9307   0.9691   0.9888
## Neg Pred Value         0.9940   0.9874   0.9905   0.9887   0.9952
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2802   0.1833   0.1665   0.1543   0.1799
## Detection Prevalence   0.2868   0.1930   0.1789   0.1592   0.1820
## Balanced Accuracy      0.9879   0.9677   0.9701   0.9680   0.9881
err_gbm <- 1 - cm_gbm$overall["Accuracy"]

best_name <- ifelse(cm_rf$overall["Accuracy"] >= cm_gbm$overall["Accuracy"], "RF", "GBM")
err_best <- ifelse(best_name == "RF", err_rf, err_gbm)
knitr::kable(data.frame(Modelo = c("RF","GBM"),
                        Accuracy = c(cm_rf$overall["Accuracy"], cm_gbm$overall["Accuracy"]),
                        ErrorFueraDeMuestra_Estimado = c(err_rf, err_gbm)))
Modelo Accuracy ErrorFueraDeMuestra_Estimado
RF 0.9954121 0.0045879
GBM 0.9643161 0.0356839

Importancia de variables (modelo ganador)

if (best_name == "RF") {
  vip::vip(rf_fit$finalModel, num_features = 20)
} else {
  vip::vip(gbm_fit$finalModel, num_features = 20)
}

Decisiones de diseño

  • Limpieza agresiva de NA: se eliminaron predictores con ≥95% de NA para evitar sobreajuste y problemas de imputación masiva.
  • Eliminación de variables no sensoras (user_name, timestamps, ventanas) por no aportar señal generalizable.
  • Estandarización: center/scale de variables numéricas para modelos basados en gradiente (GBM); RF es invariante a escala, pero la estandarización no le perjudica.
  • Partición 70/30 + CV 5-fold: permite estimar de manera más honesta el error fuera de muestra; la métrica primaria es Accuracy.
  • Selección de modelo: me quedo con el de mayor Accuracy en validación.

Entrenamiento final y predicciones para el test set

Uso el mejor modelo para predecir los 20 casos de pml-testing.csv y escribir los archivos problem_id_#.txt requeridos por el cuestionario.

# Elegir el objeto de modelo ganador
best_fit <- if (best_name == "RF") rf_fit else gbm_fit

# Predicciones para los 20 casos
pred_test <- predict(best_fit, test)
pred_test
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# Función para escribir archivos problem_id_#.txt
pml_write_files <- function(x) {
  for (i in seq_along(x)) {
    fn <- paste0("problem_id_", i, ".txt")
    write.table(x[i], file = fn, quote = FALSE, row.names = FALSE, col.names = FALSE)
  }
}

# Descomente para generar archivos en su carpeta de trabajo
# pml_write_files(pred_test)

Error esperado fuera de muestra

Como estimación del error fuera de muestra, reporto el error del modelo ganador en el validation set (30% separado):

err_best
##    Accuracy 
## 0.004587935

Adicionalmente, la Accuracy de validación cruzada (5-fold) del modelo seleccionado sugiere un rendimiento similar, lo que reduce el riesgo de sobreajuste.

Reproducibilidad

  • Este documento es auto-contenido (self-contained) y fija semilla.
  • Incluyo versiones de paquetes y sesión al final.
  • Para regenerar HTML en RStudio, haga clic en Knit.

Referencias

  • Datos originales: PUC-Rio HAR (enlace enunciado del curso).
  • Curso Practical Machine Learning (Coursera).

Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.3.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Santiago
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] vip_0.4.1            janitor_2.2.1        doParallel_1.0.17   
##  [4] iterators_1.0.14     foreach_1.5.2        gbm_2.2.2           
##  [7] randomForest_4.7-1.2 caret_7.0-1          lattice_0.22-7      
## [10] lubridate_1.9.4      forcats_1.0.0        stringr_1.5.1       
## [13] dplyr_1.1.4          purrr_1.1.0          readr_2.1.5         
## [16] tidyr_1.3.1          tibble_3.3.0         ggplot2_3.5.2       
## [19] tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6         xfun_0.53            bslib_0.9.0         
##  [4] recipes_1.3.1        tzdb_0.5.0           vctrs_0.6.5         
##  [7] tools_4.5.1          generics_0.1.4       stats4_4.5.1        
## [10] proxy_0.4-27         pkgconfig_2.0.3      ModelMetrics_1.2.2.2
## [13] Matrix_1.7-3         data.table_1.17.8    RColorBrewer_1.1-3  
## [16] lifecycle_1.0.4      compiler_4.5.1       farver_2.1.2        
## [19] codetools_0.2-20     snakecase_0.11.1     htmltools_0.5.8.1   
## [22] class_7.3-23         sass_0.4.10          yaml_2.3.10         
## [25] prodlim_2025.04.28   pillar_1.11.0        jquerylib_0.1.4     
## [28] MASS_7.3-65          cachem_1.1.0         gower_1.0.2         
## [31] rpart_4.1.24         nlme_3.1-168         parallelly_1.45.1   
## [34] lava_1.8.1           tidyselect_1.2.1     digest_0.6.37       
## [37] stringi_1.8.7        future_1.67.0        reshape2_1.4.4      
## [40] listenv_0.9.1        labeling_0.4.3       splines_4.5.1       
## [43] fastmap_1.2.0        grid_4.5.1           cli_3.6.5           
## [46] magrittr_2.0.3       survival_3.8-3       e1071_1.7-16        
## [49] future.apply_1.20.0  withr_3.0.2          scales_1.4.0        
## [52] timechange_0.3.0     rmarkdown_2.29       globals_0.18.0      
## [55] nnet_7.3-20          timeDate_4041.110    hms_1.1.3           
## [58] evaluate_1.0.4       knitr_1.50           hardhat_1.4.2       
## [61] rlang_1.1.6          Rcpp_1.1.0           glue_1.8.0          
## [64] pROC_1.19.0.1        ipred_0.9-15         rstudioapi_0.17.1   
## [67] jsonlite_2.0.0       R6_2.6.1             plyr_1.8.9