3

How can I make searches with grep on a large number of files run faster? My first attempt uses parallel (which could be improved or other approaches suggested).

The first grep simply gives the list of files, which are then passed to parallel, which runs grep again to output matches.

The parallel command is- supposed to wait for a grep to finish so that I get the results from each file together. Otherwise I get a mixup from the results form different files.

I also use sed to skip files if necessary through the command

sed -z "${ista}~${istp}!d"

Multiple patterns are stored in array ${ptrn[@]} whilst the trailing context after matching lines is defined in ${isufx[@]}.

ptrn=("-e" "FN" "-e" "DS")
ictx=(-A 8)

grep --null -r -l "${isufx[@]}"
-f <(printf "%s\n" "${ptrn[@]}") -- "${fdir[@]}"
| sed -z "${ista}~${istp}!d"
| PARALLEL_SHELL=bash psgc=$sgc psgr=$sgr uptrn=$ptrn
parallel -m0kj"$procs"
'for fl in {}; do printf "\n%s\n\n" "${psgc}==> $fl <==${psgr}" grep -ni '"${ictx[@]@Q}"' -f <(printf "%s\n" "${ptrn[@]}") -- "$fl" done'

Raffa
  • 34,963
Fatipati
  • 419

2 Answers2

2

grep is one of the most refined and time proven tools performance-wise ... Please, see for example the speed comparison of grep with other text-processing tools on very large 1G+ files with 8M+ lines here: https://askubuntu.com/a/1420653 ... Also, proper(i.e. preserving separate files output with correct line order) text-processing is not, IMHO, a suitable task for parallel because as you noticed it will mix the results from different files and shift their line order ... Although you used the parallel's -k option to keep the same output order as the input, but that might only work as intended if:

  1. You limit the parallel jobs to 1 i.e. -j 1 and also --max-procs 1 -P 1.
  2. You make sure text is passed in the right order by e.g. piping the actual text(in the right order/sequence) to parallel and use its --pipe option to pipe the text to grep afterwords.

That, however, will defy your intended purpose of running multiple jobs in parallel and therefore the added speed gain(if any) is negligible.

Also, using a for loop will require grep to fully run for each argument/file present in the loop's head with virtually the same match pattern/s for each file as it appears ... So, might not be the best approach when you are trying to speed things up ... You might be better off using e.g. grep's option --recursive in that case.

However, you can run multiple jobs in the background by sending each grep call inside your for loop to the background redirecting its output to a separate file i.e. grep ... > file1 & then later joining the resulting output files in one output file if you want ... That would run multiple instances of it in the background and greatly speed-up the loop ... Please see the demonstration below.

For demonstration purposes I will use (sleep N; echo "something" > fileN) & in place of grep ... > file1 & ... the sub-shell syntax (...;...) is necessary if you're sending multiple nested commands to the background but not needed for a single command:

$ # Creating some background jobs/processes
i=0
for f in file1 file2 file3
  do
  # Start incrementing a counter to use in filenames and calculating sleep seconds.
  ((i++))
  # Send command/s to background
  (sleep $((10*i)); echo "$f $(date)" > "${f}_${i}") &
  # Add background PID to array
  pids+=( "$!" )
  done

Output:

[1] 31335 [2] 31336 [3] 31338

$ # Monitoring and controling the background jobs/processes while sleep 5; do echo "Background PIDs are: ${pids[@]}"
for index in "${!pids[@]}" do if kill -0 "${pids[index]}" &> /dev/null; then echo "${pids[index]} is running" # Do whatever you want here if the process is running ... e.g. kill "${pids[index]}" to kill that process. else echo "${pids[index]} is not running" unset 'pids[index]' # Do whatever you want here if the process is not running. fi done if [[ "${#pids[@]}" -eq 0 ]] then echo "Combined output files contents:" cat file* unset i unset pids break fi done

Output:

Background PIDs are: 31335 31336 31338 31335 is running 31336 is running 31338 is running [1] Done ( sleep $((10i)); echo "$f $(date)" > "${f}_${i}" ) Background PIDs are: 31335 31336 31338 31335 is not running 31336 is running 31338 is running Background PIDs are: 31336 31338 31336 is running 31338 is running [2]- Done ( sleep $((10i)); echo "$f $(date)" > "${f}${i}" ) Background PIDs are: 31336 31338 31336 is not running 31338 is running Background PIDs are: 31338 31338 is running [3]+ Done ( sleep $((10*i)); echo "$f $(date)" > "${f}${i}" ) Background PIDs are: 31338 31338 is not running Combined output files contents: file1 Fri Mar 31 12:20:47 AM +03 2023 file2 Fri Mar 31 12:20:57 AM +03 2023 file3 Fri Mar 31 12:21:07 AM +03 2023

Please also see Bash Job Control.

Raffa
  • 34,963
0

It is one of GNU Paralel's examples:

https://www.gnu.org/software/parallel/parallel_examples.html#example-parallel-grep

If you are grepping the same files again and again, maybe this is usable too: https://stackoverflow.com/a/11913999/363028

Ole Tange
  • 1,742