i wanted know if there limit number of rows can read using data.table fread function. working table 4 billion rows, 4 columns, 40 gb. appears fread read first ~ 840 million rows. not give errors returns r prompt if had read data !
i understand fread not "prod use" @ moment, , wanted find out if there timeframe implementation of prod-release.
the reason using data.table that, files of such sizes, extremely efficient @ processing data compared loading file in data.frame, etc.
at moment, trying 2 other alternatives -
1) using scan , passing on data.table
data.table(matrix(scan("file.csv",what="integer",sep=","),ncol=4)) resulted in -- error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : many items
2) breaking file multiple individual segments limit of approx. 500 million rows using unix split , reading them sequentially ... looping on files sequentially fread - bit cumbersome, appears workable solution.
i think there may rcpp way faster, not sure how implemented.
thanks in advance.
i able accomplish using feedback posting on stackoverflow. process fast , 40 gb of data read in 10 minutes using fread iteratively. foreach-dopar failed work when run read files new data.tables sequentially due limitations mentioned on page below.
note: file list (file_map) prepared running --
file_map <- list.files(pattern="test.$") # replace pattern suit requirement
mclapply big objects - "serialization large store in raw vector"
quoting --
collector = vector("list", length(file_map)) # more complex normal speed for(index in 1:length(file_map)) { reduced_set <- mclapply(file_map[[index]], function(x) { on.exit(message(sprintf("completed: %s", x))) message(sprintf("started: '%s'", x)) fread(x) # <----- changed line fread }, mc.cores=10) collector[[index]]= reduced_set } # additional line (in place of rbind in url above) (i in 1:length(collector)) { rbindlist(list(finallist,yourfunction(collector[[i]][[1]]))) } # replace yourfunction needed, in case operation performed on each segment , joined them rbindlist @ end.
my function included loop using foreach dopar executed across several cores per file specified in file_map. allowed me use dopar without encountering "serialization large error" when running on combined file.
another helpful post @ -- loading files in parallel not working foreach + data.table
Comments
Post a Comment