2

I have a text file (more then 1GB in size) and it contains lines like these:

1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

all strings staring with like below, .....

10830110
1083021
10840110
10840110
1088022100
10850110
1085022100
1086022100

i need separate 8 files, how to do with sed command

muru
  • 207,228
maa
  • 91

4 Answers4

10

You could use sed to turn your file of prefixes into a file of sed commands, then use that in a sed command to process the large file - this will almost certainly be more efficient than using a shell loop to run sed (or grep) multiple times on the same (large) file. Ex. given

$ cat file2
10830110
1083021
10840110
10840110
1088022100
10850110
1085022100
1086022100

then

$ sed 's:.*:/^&/w&.txt:' file2
/10830110/w10830110.txt
/1083021/w1083021.txt
/10840110/w10840110.txt
/10840110/w10840110.txt
/1088022100/w1088022100.txt
/10850110/w10850110.txt
/1085022100/w1085022100.txt
/1086022100/w1086022100.txt

so that

$ sed 's:.*:/^&/w&.txt:' file2 | sed -n -f - file1

produces

$ head 108*.txt
==> 10830110.txt <==

==> 1083021.txt <== 1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17

==> 10840110.txt <== 10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c 10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c 1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff 1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff

==> 10850110.txt <== 108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9

==> 1085022100.txt <== 1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62

==> 1086022100.txt <== 1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

==> 1088022100.txt <== 10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71

You may want to de-duplicate the pattern file first - and possibly sort it numerically and modify the second sed command to break after the first match, so that you only match the longest prefix:

$ sort -nru file2 | sed 's:.*:/^&/{w&.txt\nb\n}:' | sed -n -f - file1

giving

$ head 108*.txt
==> 10830110.txt <==

==> 1083021.txt <== 1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17

==> 10840110.txt <== 10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c 1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff

==> 10850110.txt <== 108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9

==> 1085022100.txt <== 1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62

==> 1086022100.txt <== 1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

==> 1088022100.txt <== 10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71

terdon
  • 104,119
steeldriver
  • 142,475
7

prefix.text (contains the 8 prefixes)

1prefix
2prefix
3prefix
4prefix
x1prefix
x2prefix
x3prefix
x4prefix

input.text (like your 1 GB text file)

1prefix90956666
3prefix26588388
1prefix49080634
x3prefix59162307
x1prefix86437679
x4prefix77832956
x3prefix56458412
2prefix37484977
x2prefix73879936
x1prefix44005273
2prefix57156422
x1prefix67751608
4prefix25566629
x2prefix93657051
x3prefix40897616
4prefix93222501
3prefix35680804
x4prefix42979833
x2prefix08229240
1prefix42071365
4prefix67857600
2prefix66384962
x4prefix21482824
3prefix59616880

loop with grep, to write 1 output file per 1 prefix

while read prefix
do
    grep "^${prefix}" input.text > output_${prefix}.text
done < prefix.text

output_x1prefix.text (output example)

x1prefix86437679
x1prefix44005273
x1prefix67751608
Sheldon
  • 467
7

This will create a new file for each matched pattern with an extension of .splt in the current working directory and write all matching lines to it:

sed in a shell for loop:

for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    sed -n "/^$i/p" FileName > "$i.splt" # Change "FileName" to your file name
    done

You can do the same, as well, with awk in a shell for loop:

for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    awk -v pat="^$i" -v pat2="$i" '$0 ~ pat { print $0 > pat2".splt"}' FileName # Change "FileName" to your file name
    done

awk with an array of patterns:

awk '{pat["0"] = "10830110";
    pat["1"] = "1083021";
    pat["2"] = "10840110";
    pat["3"] = "1088022100";
    pat["4"] = "10850110";
    pat["5"] = "1085022100";
    pat["6"] = "1086022100";} {for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' YourFile

or save patterns as lines(each pattern on a new line) in a pat.txt file and let awk build the array of patterns like so:

awk 'FILENAME=="pat.txt" { pat[$i]=$0; next } { for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' pat.txt YourFile

Speed test (for science)

I tested the solutions(tested 3 times each and took the rounded average) provided in my answer as well as in the answers by @steeldriver and @Sheldon and here are the results(tested on the same average specifications PC) with the same patterns pat.txt contains:

$ cat pat.txt 
10830110
1083021
10840110
1088022100
10850110
1085022100
1086022100

and the data file file.dat is a 1.1G containing 8,484,000 lines which is made by duplicating the lines in the example provided by the OP i.e.:

1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

The results are ordered from the fastest on top and the code I used for timing is provided under each result:

#1 grep in a shell loop @Sheldon (18 seconds)

s=$(date +%s); while read prefix
do
    grep "^${prefix}" file.dat > ${prefix}.splt
done < pat.txt; e=$(date +%s); echo $(($e-$s))

More accurate timing:

$ time (while read prefix
do
    grep "^${prefix}" file.dat > ${prefix}.splt
done < pat.txt)

real 0m17.969s user 0m4.437s sys 0m2.176s

#2 sed @steeldriver (20 seconds)

s=$(date +%s); sed 's:.*:/&/w&.splt:' pat.txt | sed -n -f - file.dat; e=$(date +%s); echo $(($e-$s))

More accurate timing with ^ added in response to the comment by @terdon:

$ time (sed 's:.*:/^&/w&.splt:' pat.txt | sed -n -f - file.dat)

real 0m18.748s user 0m10.408s sys 0m1.546s

#3 sed in a shell loop @Raffa (21 seconds)

s=$(date +%s); for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    sed -n "/^$i/p" file.dat > "$i.splt" # Change "FileName" to your file name
    done; e=$(date +%s); echo $(($e-$s))

#4 awk in a shell loop @Raffa (35 seconds)

s=$(date +%s); for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    awk -v pat="^$i" -v pat2="$i" '$0 ~ pat { print $0 > pat2".splt"}' file.dat # Change "FileName" to your file name
    done; e=$(date +%s); echo $(($e-$s))

#5 awk @Raffa (414 seconds) <-- That was a shock

s=$(date +%s); awk '{pat["0"] = "10830110";
    pat["1"] = "1083021";
    pat["2"] = "10840110";
    pat["3"] = "1088022100";
    pat["4"] = "10850110";
    pat["5"] = "1085022100";
    pat["6"] = "1086022100";} {for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' file.dat; e=$(date +%s); echo $(($e-$s))
Raffa
  • 34,963
0

If the file is already split in to lines which start with these strings, as it is in the example, you can use awk in a way as this (reference):

 awk '{file="file."(++i)".txt"}{print > file;}' input-file.txt

This will produce a new file for each line.

If we suppose the starting strings have fixed length of 7 character (which is not the case in the example), we can split the input file into separate files for each starting string string by something like (reference):

awk '{file="file."(substr($1,1,7))".txt"}{print >> file;}' input-file.txt
pa4080
  • 30,621