FastCrawl is a php web scraping framework that uses cURL, MySQL and xpath for lightning fast accurate web scraping and crawling.
When it comes to doing web crawls, 90% of processing time is waiting for network I/O. That is the web request is signifigantly slower than any other part of the system, and becomes the bottleneck. In other languages one would use multithreading to speed things up. Unfortunatly, PHP doesn't support multithreading.
PHP does support nonblocking I/O. This tehcnique is ideal since we can simulate multiple threads, and service requests with the socket select() command. But reproducing the http protocol using sockets is a poor use of time. A more robust technique is to use cURL.
cURL supports non blocking I/O natively in the driver. The problem is you have to build all the code around it to be usefull.
Enter FastCrawl. It uses cUrl and mysql as a threadsafe queue. In your driver code, you just load up urls, pass an optional callback (the default callback populates the raw_data table), then process the resulting data from the table normally. Because it uses database transactions, you can even run multiple php instances.
Fast crawl supports any parsing anyway you want, but just how much faster is using Built in functions vs simple_html_dom? Check out the interactive benchmark!
FastCrawl has built in support for csv files, so your output always comes out in a format that you can use right away.
