Row limit for data.table in R using fread -


i wanted know if there limit number of rows can read using data.table fread function. working table 4 billion rows, 4 columns, 40 gb. appears fread read first ~ 840 million rows. not give errors returns r prompt if had read data !

i understand fread not "prod use" @ moment, , wanted find out if there timeframe implementation of prod-release.

the reason using data.table that, files of such sizes, extremely efficient @ processing data compared loading file in data.frame, etc.

at moment, trying 2 other alternatives -

1) using scan , passing on data.table

data.table(matrix(scan("file.csv",what="integer",sep=","),ncol=4))  resulted in -- error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :   many items 

2) breaking file multiple individual segments limit of approx. 500 million rows using unix split , reading them sequentially ... looping on files sequentially fread - bit cumbersome, appears workable solution.

i think there may rcpp way faster, not sure how implemented.

thanks in advance.

i able accomplish using feedback posting on stackoverflow. process fast , 40 gb of data read in 10 minutes using fread iteratively. foreach-dopar failed work when run read files new data.tables sequentially due limitations mentioned on page below.

note: file list (file_map) prepared running --

file_map <- list.files(pattern="test.$")  # replace pattern suit requirement 

mclapply big objects - "serialization large store in raw vector"

quoting --

collector = vector("list", length(file_map)) # more complex normal speed   for(index in 1:length(file_map)) { reduced_set <- mclapply(file_map[[index]], function(x) {   on.exit(message(sprintf("completed: %s", x)))   message(sprintf("started: '%s'", x))   fread(x)             # <----- changed line fread }, mc.cores=10) collector[[index]]= reduced_set  }  # additional line (in place of rbind in url above)  (i in 1:length(collector)) { rbindlist(list(finallist,yourfunction(collector[[i]][[1]]))) } # replace yourfunction needed, in case operation performed on each segment , joined them rbindlist @ end. 

my function included loop using foreach dopar executed across several cores per file specified in file_map. allowed me use dopar without encountering "serialization large error" when running on combined file.

another helpful post @ -- loading files in parallel not working foreach + data.table


Comments