1

I have a file containing 2.3M lines. Which looks like:

$less V2.fastq

>TS19_EWP4IQK02JPFP5
CATGCTGCCTCCCGTAGGAGTTTGGTCCGTGTCTCAGTACCAATGTGGGGGACCTTCCTC
TCAGAACCCTATCCATCGTCGGTTTGGTGGGCCGTTACCCGCCAACTGCCTAATGGAACG
CATGCCTATCTATCAGCGATGAATCTTTAGCAAATATCCCCATGCGGGGCCCTGCTTCAT
GCGGTATTAGTCCGACTTTCGCCGGGTTATCCCCTCTGATAGGTAAGTTGCATACGCGTT
ACTCACCGTGCGCCGG
>TS20_EWP4IQK02FSQQL
CATGCTGCCTCCCGTAGGAGTTTGGACCGGTGTCTCAGTTCCAACTGTGGGGGGACCTTC
CTCTCCAGAACCCCCTATCCCATCGAAG
>TS19_EWP4IQK02GBB8K
CATGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCCGATCACCCTC
TCAGGTCGGCTATGTATCGTCGCCTAGGTGAGCCGTTACCTCACCTACTAGCTAATACAA
CGCAGGTCCATCTTGTAGTGGAGCATTTGCCCCTTTCAAATAAATGACATGAGTCACCCA
TTGTTATGCGGTATTAGCTATCGTTTCCAATAGTTATCCCCCGCTACAAGGCAGGTTACC
TACGCG
>TS19_EWP4IQK02FUJRM
CATGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTC
TCAGAACCCCTATCCATCGAAGACTAGGTGGGCCGTTACCCCGCCTACTATCTAATGGAA
CGCACCCCCATCTTACACCGGTAAACCTTTAATCATGCGAAAATGCTTACTCATGATAAC
ATCTTGTATTAATCTCCCTTTCAGAAGGCTGTCCAAGAGTGTAAGGCAGGTTGGATACGC
GTTACTCACCCGTGCGCCGGTCG
>TS119_EWP4IQK02I2KHZ
CATGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTC
TCAGAACCCCTATCCATCGATGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAA
CGCATCCCCATCAATGACCGAAATTCTTTAATAGCTGAAAGATGCCTTTCAGATATACCA
TCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTG
TTACTCACCCGTGCGCCGTCG

Line that starts with ">" denotes a single SampleID. Sample name is designated by the term before "_" in that line. For example:TS19, TS20, TS119, etc. I want to make separate output files for each such sample that contains the SampleID and the content within. Can anyone please help me?

Many thanks

edit:1 For getting output for sample TS_19 we can use this command which returns following output: Command

sed -n '/>TS19_/, />/p' V2.fasta 

Output (a few lines out of thousands)

>TS19_ok4.40713 CTAACGCAGTCA
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGGCTTGGTAGGCCGTTACCCCACCAACTACCTAATCAGACGCGGGTCCATCTCATACCACCGGAGCTTTTTCACACCGTACCATGCGGTACTGTGCGCTTATGCGGTATTAGCAGTCGTTTCCAACTGTTATCCCCTGTATGAGGCAGGTTACCCACGCGTTACTCACCCGTCCG
>TS6.2_ok4.40714 CGTCAGACGGAT
>TS19_ok4.40771 CTAACGCAGTCA
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGCTTTGGTAGGCCGATACCCCACCAACCGGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTACCCCTCGCACCATGCGGTGCTGTGGTCTTATGCGGTATTAGCAGTCATTTCTTGACTGTTTATTTCCCCTCGTATGAGGCAGGTTACCCACGCGTTACTCACCCG
>TS8_ok4.40772 TCGAGACGCTTA
>TS19_ok4.40971 CTAACGCAGTCA
CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCATCGCCTTGGTGGGCCGTTACCCCGCCAACAAGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTCACACTGTACCATGTGGTACTGTGCGCTTATGCGGTATTACCAGCCGTTTCCAGCTGCTATCCCCATCTGAAGGGCAGGTTGCTTACGCGG
>TS127_ok4.40972 GACCGAGCTATG

I just want to remove the lines that starts with > but don't follow TS_19. Can anyone help me?

edit:2 https://drive.google.com/file/d/17MC0tiIE6axOJqNZukzsQX5bVpuvV312/view?usp=sharing

DEEP
  • 59

3 Answers3

1

Edit 1 Take away the -n 7 ... you won't need it.

csplit -z v2.fastq  -f TestSample /\>TS/ '{*}'

Will generate files TestSample00, TestSample01, TestSample02,TestSample03,... TestSamplennnnnn based upon your file.

Finally, you'll want a prefix to identify all these files. Sorry my solution doesn't rename your file to show the Test Sample number naming convention, but at least you can vary it each time you run the command by changing the prefix with the -f parameter.

Edit 2
If however you need all of your data having the same test sample identifier collected together in the same file, then follow up with a command such as

find . -name "TestSample*" | xargs grep -l TS19_ | awk '{print "cat " $1"  >> My_TS19_.fasta " }' | sh

The new file (My_TS19_.fasta) will have all the sequences in it that pertain to TS19_ or whatever case-sensitive string you put in after grep

I've added the xargs command to stream the list of files rather than choking the find command.

The awk command takes the file names and appends each one to an initially non-existent or empty file. Be careful to use a new file each time to avoid making duplicates.

mondotofu
  • 817
  • 6
  • 12
1

I wrote a perl script a while ago, specifically for this.

The script takes a fasta file and creates individual files for all sequences. It will also clean the fasta file: Linebreaks in the sequence as well as empty lines and leading whitespaces in headers (> id) are removed by default. Additionally, non ACGT charachters can be converted to N and lowercase sequence characters can be converted to uppercase.

The script is called split_fasta.pl and you can find it on my github: https://github.com/nterhoeven/sequence_processing

Wayne_Yux
  • 4,942
1

With awk, you can set > as the record separator and process(match) whole records instead of lines and search for e.g. records containing "TS19" like so:

awk 'BEGIN {RS=">"; ORS=RS} /TS19/' V2.fasta

Or automatically split each record type into a file with .split extension i.e. TS119.split TS19.split TS20.split ... in the same working directory like so:

awk 'BEGIN {RS=">"; ORS=RS} {split($1, arr, "_"); f=arr[1]".split"; print > f}' V2.fasta
Raffa
  • 34,963