Python web crawler multithreading and multiprocessing -


briefly idea, web crawler have 2 main jobs. collector , crawler, collector collecting of url items each sites , storing non duplicated url. crawler grab urls storage, extract needed data , store back.

2 machines

  1. bot machine -> 8 core, physical linux os (no vm on machine)

  2. storage machine -> mysql clustering (vm clustering), 2 databases (url , data); url database on port 1 , data port 2

objective: crawled 100 sites , try decrease bottle neck situation

first case: collector *request(urllib) sites , collect url items each sites , * insert if it's non duplicated url storage machine on port 1. crawler *get url storage port 1 , *request site , extract needed data , *store it's on port 2

this cause connection bottle neck both request web sites , mysql connection

second case: instead of inserting across machine, collector store url on own mini database file system.there no *read huge file(use os command technic) *write (append) , *remove header of file.

this cause connection request web sites , i/o (read,write) bottle neck (may be)

both case have cpu bound cause of collecting , crawling 100 sites

as heard i/o bound use multithreading, cpu bound use multiprocessing

how both ? scrappy ? idea or suggestion ?

look grequests, doesn't actual muti-threading or multiprocessing, scales better both.


Comments