2

I have schedule some crontab jobs to scrape a number of websites.

I have set some cron jobs to run the scrapers at 1 AM, scraper_1 starts on 1:01, scraper_2 starts on 1:03 and scraper_3 starts on 1:05

Each scraper may take 3 to 6 mins to complete, so there are some overlapping time between running scrappers.

# start on 1:01
01 01 * * * cd /home/ubuntu/jobscrapers/scraper_1 && scrapy crawl spider_1 >> /tmp/scraper.log 2>&1

start on 1:03

03 01 * * * cd /home/ubuntu/jobscrapers/scraper_2 && scrapy crawl spider_2 >> /tmp/scraper.log 2>&1

start on 1:05

05 01 * * * cd /home/ubuntu/jobscrapers/scraper_3 && scrapy crawl spider_3 >> /tmp/scraper.log 2>&1

All of these scrapers are written using Scrapy and they use Selenium and Chrome Web Driver.

The code runs fine on my development machine (windows)... but recently I am getting some occasional errors on the production machine (Ubuntu)

For example an scraper run fines for some time and then it crashes with the following error:

selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash from unknown error: cannot determine loading status from tab crashed (Session info: headless chrome=86.0.4240.111) (Driver info: chromedriver=2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706),platform=Linux 5.4.0-1029-aws x86_64)

Is this because 2 scraper are running at the same time? Does crontab create a new thread for each scraper (webdriver)?

Updated question

The issue was that there was no space left on the server...

I realized the problem by accident, the scrapy log was not helpful. Was there other logs that I should have checked to point me to the actual issue?

Hooman Bahreini
  • 518
  • 1
  • 8
  • 24

1 Answers1

2

The issue was that there was no space left on my sever:

I used the df -h command to check the available space and noticed that the / partition was 100% full:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        460M     0  475M   0% /dev
tmpfs           478M     0  492M   0% /dev/shm
tmpfs           478M  432K  492M   1% /run
tmpfs           478M     0  492M   0% /sys/fs/cgroup
/dev/nvme0n1p1  8.0G  8.0G  664K 100% /
tmpfs            96M     0   99M   0% /run/user/1000

As my server is an AWS EC2 instance, I had to extent the volume. The following 2 links explain how to extend an EC2's volume:

  1. How to extend EC2 volumn
  2. How to extend volume if you receive an error that there's no space
Hooman Bahreini
  • 518
  • 1
  • 8
  • 24