vector - Converting Character to Numeric without NA Coercion in R -


i'm working in r , have dataframe, dd_2006, numeric vectors. when first imported data, needed remove $'s, decimal points, , blank spaces 3 of variables: sumofcost, sumofcases, , sumofunits. that, used str_replace_all. however, once used str_replace_all, vectors converted characters. used as.numeric(var) convert vectors numeric, nas introduced, though when ran code below before ran as.numeric code, there no nas in vectors.

sum(is.na(dd_2006$sumofcost)) [1] 0 sum(is.na(dd_2006$sumofcases)) [1] 0 sum(is.na(dd_2006$sumofunits)) [1] 0 

here code after import, beginning removing $ vector. in str(dd_2006) output, deleted of variables sake of space, column #s in str_replace_all code below don't match output i've posted here (but in original code):

library("stringr") dd_2006$sumofcost <- str_sub(dd_2006$sumofcost, 2, ) #2=the first # after $  #removes decimal pt, zero's after, , commas dd_2006[ ,9] <- str_replace_all(dd_2006[ ,9], ".00", "") dd_2006[,9] <- str_replace_all(dd_2006[,9], ",", "")  dd_2006[ ,10] <- str_replace_all(dd_2006[ ,10], ".00", "") dd_2006[ ,10] <- str_replace_all(dd_2006[,10], ",", "")  dd_2006[ ,11] <- str_replace_all(dd_2006[ ,11], ".00", "") dd_2006[,11] <- str_replace_all(dd_2006[,11], ",", "")  str(dd_2006) 'data.frame':   12604 obs. of  14 variables:  $ cmhsp                     : factor w/ 46 levels "allegan","ausable valley",..: 1 1 1  $ fy                        : factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1 ...  $ population                : factor w/ 1 level "dd": 1 1 1 1 1 1 1 1 1 1 ...  $ sumofcases                : chr  "0" "1" "0" "0" ...  $ sumofunits                : chr  "0" "365" "0" "0" ...  $ sumofcost                 : chr  "0" "96416" "0" "0" ... 

i found response similar question mine here, using following code:

# create dummy data.frame d <- data.frame(char = letters[1:5],                  fake_char = as.character(1:5),                  fac = factor(1:5),                  char_fac = factor(letters[1:5]),                  num = 1:5, stringsasfactors = false) 

let have glance @ data.frame

> d   char fake_char fac char_fac num 1            1   1          1 2    b         2   2        b   2 3    c         3   3        c   3 4    d         4   4        d   4 5    e         5   5        e   5 

and let run:

> sapply(d, mode)        char   fake_char         fac    char_fac         num  "character" "character"   "numeric"   "numeric"   "numeric"  > sapply(d, class)        char   fake_char         fac    char_fac         num  "character" "character"    "factor"    "factor"   "integer"  

now ask "where's anomaly?" well, i've bumped quite peculiar things in r, , not confounding thing, can confuse you, if read before rolling bed.

here goes: first 2 columns character. i've deliberately called 2nd 1 fake_char. spot similarity of character variable 1 dirk created in reply. it's numerical vector converted character. 3rd , 4th column factor, , last 1 "purely" numeric.

if utilize transform function, can convert fake_char numeric, not char variable itself.

> transform(d, char = as.numeric(char))   char fake_char fac char_fac num 1   na         1   1          1 2   na         2   2        b   2 3   na         3   3        c   3 4   na         4   4        d   4 5   na         5   5        e   5 warning message: in eval(expr, envir, enclos) : nas introduced coercion if same thing on fake_char , char_fac, you'll lucky, , away no na's: 

transform(d, fake_char = as.numeric(fake_char), char_fac = as.numeric(char_fac))

  char fake_char fac char_fac num 1            1   1        1   1 2    b         2   2        2   2 3    c         3   3        3   3 4    d         4   4        4   4 5    e         5   5        5   5 

so tried above code in script, still came nas (without warning message coercion).

#changing sumofcases, cost, , units numeric dd_2006_1 <- transform(dd_2006, sumofcases = as.numeric(sumofcases), sumofunits = as.numeric(sumofunits), sumofcost = as.numeric(sumofcost))  > sum(is.na(dd_2006_1$sumofcost)) [1] 12 > sum(is.na(dd_2006_1$sumofcases)) [1] 7 > sum(is.na(dd_2006_1$sumofunits)) [1] 11 

i've used table(dd_2006$sumofcases) etc. @ observations see if there characters missed in observations, there weren't any. thoughts on why nas popping up, , how rid of them?

as anando pointed out, problem somewhere in data, , can't without reproducible example. said, here's code snippet pin down records in data causing problems:

test = as.character(c(1,2,3,4,'m')) v = as.numeric(test) # nas intorduced coercion ix.na = is.na(v) which(ix.na) # row index of our problem = 5 test[ix.na]  # shows problematic record, "m" 

instead of guessing why nas being introduced, pull out records causing problem , address them directly/individually until nas go away.

update: looks problem in call str_replace_all. don't know stringr library, think can accomplish same thing gsub this:

v2 = c("1.00","2.00","3.00") gsub("\\.00", "", v2)  [1] "1" "2" "3" 

i'm not entirely sure accomplishes though:

sum(as.numeric(v2)!=as.numeric(gsub("\\.00", "", v2))) # illustrate vectors equivalent.  [1] 0 

unless achieves specific purpose you, i'd suggest dropping step preprocessing entirely, doesn't appear necessary , seems giving problems.


Comments