網路爬蟲(英語:web crawler),也叫網路蜘蛛(spider),是一種用來自動瀏覽全球資訊網的網路機器人。其目的一般為編纂網路索引。
// Request a page by url, the response content is stored
// in global char array buffer[], while status code (eg.
// 200, 404) is returned.
int requestPage(char url[]);
這個函數是各位在寫大作業一時最基本的工具喔!
char url[MAX_LEN] = "https://curl.se/libcurl/";
int status;
do {
status = requestPage(url);
} while (status != 200);
cout << buffer << endl;
假設第 2 頁的網址長這樣:
https://books.toscrape.com/catalogue/page-2.html
那麼第 n 頁的網址應該長這樣:
https://books.toscrape.com/catalogue/page-n.html
int page, status;
char url[MAX_LEN];
char tem[] = "https://books.toscrape.com/catalogue/page-%d";
for(page = 0; status != 404; page++){
sprintf(url, tem, page);
status = requestPage(url);
// some other logic to do with buffer[]
}
// Parse html in to global CDocument Object doc,
// remember to parse html before stripping contents!
void parseHtml(const char html[]);
// Strip content within global doc with specified CSS
// selector, the stripped content is stored in global
// char array stripped[], if there is no matching nodes
// then it returns values other than 0
int stripContent(const char selector[]);