Working with tab/comma/pipe delimited files (more sed).
#1
So I've been working on a world-wide country, city, town, geographical feature (lake, ponds, reserves, parks, hotels, etc.) wordlist. Now I wasn't sure if I should write this new thread, and please Atom, if you're tired of commandline threads tell me and I will stop.

Anyways, I'm learning some sed right now and here are some commands I used. I'd like to thank epixoip and M@lik for teaching me some sed fundamentals. sed is really powerful. My first question is, any way to make these prettier?

I'll dive right in to the most complex/ugly command. I had a file delimited by pipes, which looked like this:

Code:
399|Agua Sal Creek|Stream|AZ|04|Apache|001|362740N|1092842W|36.4611122|-109.4784394|362053N|1090915W|36.3480582|-109.1542662|1645|5397|Fire Dance Mesa|02/08/1980|
400|Agua Sal Wash|Valley|AZ|04|Apache|001|363246N|1093103W|36.546112|-109.5176069|362740N|1092842W|36.4611122|-109.4784394|1597|5239|Little Round Rock|02/08/1980|
401|Aguaje Draw|Valley|AZ|04|Apache|001|343417N|1091313W|34.5714281|-109.2203696|344308N|1085826W|34.7188|-108.9739|1750|5741|Kearn Lake|02/08/1980|01/14/2008

So I used:

Code:
cat NationalFile_20120602.txt | sed -n 's/^[^|]*[|]\([^|]*\)[|][^|]*[|][^|]*[|][^|]*[|]\([^|]*\)[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|]\([^|]*\).*/\1,\2,\3/p' | tr -s ',' '\n' | sort | uniq > NationalFile-Names.txt

Which outputs something like this:

Code:
Agua Sal Creek
Agua Sal Wash
Aguaje Draw
Apache
Fire Dance Mesa
Kearn Lake
Little Round Rock

I know the command is ugly but it worked.
What I was doing:
Find stuff before first tab, than a tab, than find more stuff before a tab, than a tab etc. Until I got to where I wanted to be. Then I used a () group to isolate what I wanted and continued. At the end, I call back my 3 backreferences (thank you so much epixoip) that I had extracted from each line. I seperate those with a comma and then use the tr command to change the commas to a linebreak. I then sort and remove duplicates.

Any ideas on how to make this humanly readable?

Finally, some stuff I learned using CSV files. If you want to replace all commas by a EOL (return):
Code:
tr -s ','  '\n'

Replace a certain character in a file:
Code:
tr -s '_' ' '

If you want to work with something else than a pipe, lets say a comma, use [^,]*[,] and it will work. Same for tabs.

Finally I have a question. Working on my cities, countries, counties, towns and whatever file. I'd like to ask what you'd prefer (if interested). It's quite a big list, 170MB last time I checked, and there are many spaces in the file. Take for example 'Dawson Creek'... Would you prefer if I share a list without spaces, or that I leave them in there (and you can easily do the work yourself)? Would you prefer a no-caps list? I'm asking because that might make the file smaller...

Thank you,
Socapex
Reply


Messages In This Thread
Working with tab/comma/pipe delimited files (more sed). - by Socapex - 06-18-2012, 06:45 AM