4

I need to test the response code of 1 Billion (yes with a "B") pages. I am perfectly aware that no matter how I twist it, it will take many days to do. I have time.

However, my current script seems to only do ~200 per second. This isn't fast enough. At this rate it will take 2 months.

mycurl() {
    response=$(curl --write-out %{http_code} --silent --output /dev/null http://www.example.com/test/$1)
    echo "$1"
    if [ "$response" == "200" ]; then
        echo "valid" > "enum/$1" | cat
    fi
}
export -f mycurl

i=0

seq 1000000000 | parallel -j0 mycurl

I have a feeling parallel isn't going as fast as it could (i.e. waiting for something).

I have found this but am unsure about how to make it work: https://www.gnu.org/software/parallel/sem.html

How can I optimise my script?

1 Answers1

1

Use --head (or -I) to fetch just the headers, rather than the headers plus the contents of the web page. Depending on the size of the web pages, this may reduce network traffic.

You have already specified that the output is to be discarded, so no time is lost writing it to a file. Therefore the only reduction this will give is on the server side and on the network. The server will not actually send the page over the net, but it may still generate it, or retrieve it from cache.

Jos
  • 30,529
  • 8
  • 89
  • 96