Merging/joining a lot of csv files with numeric digits in the file name

Question

As we process our csv data, we generate a lot of output files with 30 000 lines in each one of them. They all have the same columns/fields. They are all also in csv format and we put them into the same folder on the Linux server. The files are uniquely named using a combination of date, time and numeric digits. See below.

AB_20151127_120000_0_SEGMENT_FINAL.csv
AB_20151127_120000_1_SEGMENT_FINAL.csv
AB_20151127_120000_2_SEGMENT_FINAL.csv
AB_20151127_120000_3_SEGMENT_FINAL.csv
.
.
.
AB_20151127_120000_599_SEGMENT_FINAL.csv

So now we need to merge/join all of them into one big file called: AB_20151127_120000_SEGMENT_FINAL.csv (note the missing numeric digits from the merged file)

I tried awk as below but it is not working. Please tell me what I did wrong.

awk '"AB_20151127_120000_" NR-1 "_SEGMENT_FINAL.csv"' > AB_20151127_120000_SEGMENT_FINAL.csv

score 3 · Accepted Answer · edited May 07 '16 at 09:06

If the order in which the files are concatenated is not important, use:

cat AB_20151127_120000_*_SEGMENT_FINAL.csv > AB_20151127_120000_SEGMENT_FINAL.csv

If the order is important, you'll have to get creative. If you know the number of segments, 599 for example, you can use brace expansion (the \ is only there to let me print the command on two lines for readability):

cat AB_20151127_120000_{0..599}_SEGMENT_FINAL.csv > \
    AB_20151127_120000_SEGMENT_FINAL.csv

If you don't, you can still use brace expansion. Just choose a large enough number to be sure that all files will be included and ignore error messages about non-existant files:

cat AB_20151127_120000_{0..599}_SEGMENT_FINAL.csv > \
    AB_20151127_120000_SEGMENT_FINAL.csv 2>/dev/null

Alternatively, you can generate a list of sorted file names and use that:

cat $(printf '%s\n' AB_20151127_120000_*_SEGMENT_FINAL.csv | sort -nt_ -k4) > \
    AB_20151127_120000_SEGMENT_FINAL.csv

The printf will print each file name followed by a newline which is the passed to sort which will sort it numerically (-n) on the 4th field (-t4) where fields are defined by _ (-t_).

kos · Answer 2 · 2016-05-07T09:08:54.340

If you have access to a Zsh shell, the task can be reduced to a single command:

cat AB_20151127_120000_*(n)_SEGMENT_FINAL.csv >AB_20151127_120000_SEGMENT_FINAL.csv

This is because the (n) globbing qualifier forces the * globbing pattern to expand to a list of filenames sorted in their natural order, as opposed to their lexicographical order.

For comparison, filename expansion in Bash:

$ for f in *; do echo "$f"; done
AB_20151127_120000_0_SEGMENT_FINAL.csv
AB_20151127_120000_10_SEGMENT_FINAL.csv
AB_20151127_120000_1_SEGMENT_FINAL.csv
AB_20151127_120000_2_SEGMENT_FINAL.csv
AB_20151127_120000_3_SEGMENT_FINAL.csv

Filename expansion in Zsh using the (n) globbing qualifier:

% for f in *(n); do echo "$f"; done
AB_20151127_120000_0_SEGMENT_FINAL.csv
AB_20151127_120000_1_SEGMENT_FINAL.csv
AB_20151127_120000_2_SEGMENT_FINAL.csv
AB_20151127_120000_3_SEGMENT_FINAL.csv
AB_20151127_120000_10_SEGMENT_FINAL.csv

Merging/joining a lot of csv files with numeric digits in the file name

2 Answers2