最近發現的一個奇怪的問題。

從osboxorg下載的VM檔,嘗試使用 getURL下載資料,但不指定encoding時,抓中文網頁,竟然會出現亂碼。

已經知道他會預設使用 ISO-8859-1的編碼,但是為什麼不使用 UTF-8呢?

文件裡的說明

.encodingan integer or a string that explicitly identifies the encoding of the content that is returned by the HTTP server in its response to our query. The possible strings are ‘UTF-8’ or ‘ISO-8859-1’ and the integers should be specified symbolically as CE_UTF8 and CE_LATIN1. Note that, by default, the package attempts to process the header of the HTTP response to determine the encoding. This argument is used when such information is erroneous and the caller knows the correct encoding. The default value leaves the decision to this default mechanism. This does however currently involve processing each line/chunk of the header (with a call to an R function). As a result, if one knows the encoding for the resulting response, specifying this avoids this slight overhead which is probably quite small relative to network latency and speed.

但目前還沒找出解決辦法

Copy to Clipboard

但是使用別的方式下載

library(httr)
GET(“https://www.ptt.cc/bbs/Lifeismoney/M.1567741886.A.BA9.html”)

readLines(base::url(“https://www.ptt.cc/bbs/Lifeismoney/M.1567741886.A.BA9.html”, method = “libcurl”))

看起來又正常。

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.04

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
 [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tm_0.7-6        NLP_0.2-0       pttR_0.0.0.9000 rvest_0.3.4     xml2_1.2.2      readxl_1.3.1    reshape2_1.4.3 
 [8] dplyr_0.8.3     jiebaR_0.10.99  jiebaRD_0.1    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2          lubridate_1.7.4     lattice_0.20-38     tidyr_0.8.3         assertthat_0.2.1    digest_0.6.20      
 [7] slam_0.1-45         R6_2.4.0            cellranger_1.1.0    plyr_1.8.4          backports_1.1.4     acepack_1.4.1      
[13] quanteda_1.5.1      httr_1.4.1          ggplot2_3.2.1       pillar_1.4.2        rlang_0.4.0         lazyeval_0.2.2     
[19] rstudioapi_0.10     data.table_1.12.2   R.utils_2.9.0       R.oo_1.22.0         rpart_4.1-13        Matrix_1.2-15      
[25] checkmate_1.9.4     splines_3.5.2       selectr_0.4-1       stringr_1.4.0       foreign_0.8-71      htmlwidgets_1.3    
[31] RCurl_1.95-4.12     munsell_0.5.0       compiler_3.5.2      spacyr_1.2          xfun_0.9            pkgconfig_2.0.2    
[37] base64enc_0.1-3     htmltools_0.3.6     nnet_7.3-12         tidyselect_0.2.5    tibble_2.1.3        gridExtra_2.3      
[43] htmlTable_1.13.1    Hmisc_4.2-0         crayon_1.3.4        bitops_1.0-6        R.methodsS3_1.7.1   grid_3.5.2         
[49] gtable_0.3.0        magrittr_1.5        scales_1.0.0        RcppParallel_4.4.3  stringi_1.4.3       remotes_2.1.0      
[55] latticeExtra_0.6-28 stopwords_1.0       fastmatch_1.1-0     Formula_1.2-3       RColorBrewer_1.1-2  tools_3.5.2        
[61] glue_1.3.1          purrr_0.3.2         parallel_3.5.2      survival_2.43-3     colorspace_1.4-1    cluster_2.0.7-1    
[67] knitr_1.24