briefly idea, web crawler have 2 main jobs. collector , crawler, collector collecting of url items each sites , storing non duplicated url. crawler grab urls storage, extract needed data , store back.
2 machines
bot machine -> 8 core, physical linux os (no vm on machine)
storage machine -> mysql clustering (vm clustering), 2 databases (url , data); url database on port 1 , data port 2
objective: crawled 100 sites , try decrease bottle neck situation
first case: collector *request(urllib) sites , collect url items each sites , * insert if it's non duplicated url storage machine on port 1. crawler *get url storage port 1 , *request site , extract needed data , *store it's on port 2
this cause connection bottle neck both request web sites , mysql connection
second case: instead of inserting across machine, collector store url on own mini database file system.there no *read huge file(use os command technic) *write (append) , *remove header of file.
this cause connection request web sites , i/o (read,write) bottle neck (may be)
both case have cpu bound cause of collecting , crawling 100 sites
as heard i/o bound use multithreading, cpu bound use multiprocessing
how both ? scrappy ? idea or suggestion ?
look grequests, doesn't actual muti-threading or multiprocessing, scales better both.
Comments
Post a Comment