Working with tab/comma/pipe delimited files (more sed).
#1
So I've been working on a world-wide country, city, town, geographical feature (lake, ponds, reserves, parks, hotels, etc.) wordlist. Now I wasn't sure if I should write this new thread, and please Atom, if you're tired of commandline threads tell me and I will stop.

Anyways, I'm learning some sed right now and here are some commands I used. I'd like to thank epixoip and M@lik for teaching me some sed fundamentals. sed is really powerful. My first question is, any way to make these prettier?

I'll dive right in to the most complex/ugly command. I had a file delimited by pipes, which looked like this:

Code:
399|Agua Sal Creek|Stream|AZ|04|Apache|001|362740N|1092842W|36.4611122|-109.4784394|362053N|1090915W|36.3480582|-109.1542662|1645|5397|Fire Dance Mesa|02/08/1980|
400|Agua Sal Wash|Valley|AZ|04|Apache|001|363246N|1093103W|36.546112|-109.5176069|362740N|1092842W|36.4611122|-109.4784394|1597|5239|Little Round Rock|02/08/1980|
401|Aguaje Draw|Valley|AZ|04|Apache|001|343417N|1091313W|34.5714281|-109.2203696|344308N|1085826W|34.7188|-108.9739|1750|5741|Kearn Lake|02/08/1980|01/14/2008

So I used:

Code:
cat NationalFile_20120602.txt | sed -n 's/^[^|]*[|]\([^|]*\)[|][^|]*[|][^|]*[|][^|]*[|]\([^|]*\)[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|][^|]*[|]\([^|]*\).*/\1,\2,\3/p' | tr -s ',' '\n' | sort | uniq > NationalFile-Names.txt

Which outputs something like this:

Code:
Agua Sal Creek
Agua Sal Wash
Aguaje Draw
Apache
Fire Dance Mesa
Kearn Lake
Little Round Rock

I know the command is ugly but it worked.
What I was doing:
Find stuff before first tab, than a tab, than find more stuff before a tab, than a tab etc. Until I got to where I wanted to be. Then I used a () group to isolate what I wanted and continued. At the end, I call back my 3 backreferences (thank you so much epixoip) that I had extracted from each line. I seperate those with a comma and then use the tr command to change the commas to a linebreak. I then sort and remove duplicates.

Any ideas on how to make this humanly readable?

Finally, some stuff I learned using CSV files. If you want to replace all commas by a EOL (return):
Code:
tr -s ','  '\n'

Replace a certain character in a file:
Code:
tr -s '_' ' '

If you want to work with something else than a pipe, lets say a comma, use [^,]*[,] and it will work. Same for tabs.

Finally I have a question. Working on my cities, countries, counties, towns and whatever file. I'd like to ask what you'd prefer (if interested). It's quite a big list, 170MB last time I checked, and there are many spaces in the file. Take for example 'Dawson Creek'... Would you prefer if I share a list without spaces, or that I leave them in there (and you can easily do the work yourself)? Would you prefer a no-caps list? I'm asking because that might make the file smaller...

Thank you,
Socapex
Reply
#2
Since this is not about hash-cracking, I believe this is not the right place to discuss it.

However, I don't mind helping you:
awk is the best tool for this job, on Windows using gawk:
Code:
type 1.txt
399|Agua Sal Creek|Stream|AZ|04|Apache|001|362740N|1092842W|36.4611122|-109.4784394|362053N|1090915W|36.3480582|-109.1542662|1645|5397|Fire Dance Mesa|02/08/1980|
400|Agua Sal Wash|Valley|AZ|04|Apache|001|363246N|1093103W|36.546112|-109.5176069|362740N|1092842W|36.4611122|-109.4784394|1597|5239|Little Round Rock|02/08/1980|
401|Aguaje Draw|Valley|AZ|04|Apache|001|343417N|1091313W|34.5714281|-109.2203696|344308N|1085826W|34.7188|-108.9739|1750|5741|Kearn Lake|02/08/1980|01/14/2008

gawk -F"|" "{print $2 \"\n\" $6 \"\n\" $18}" 1.txt | sort -u
Agua Sal Creek
Agua Sal Wash
Aguaje Draw
Apache
Fire Dance Mesa
Kearn Lake
Little Round Rock
Simple enough?
Reply
#3
(06-18-2012, 07:41 PM)M@LIK Wrote: Since this is not about hash-cracking, I believe this is not the right place to discuss it.

This is what the new section is for !! Big Grin

Atom has already said that it is ok to talk about things in this section as long as it is loosely related to hash-cracking.

The normal forum rules apply however, no posting hashes, warez and no advertising. This conversation is very interesting, please carry on ! Smile
Reply
#4
I really appreciate this type of discussion as I'm new to awk and sed as well.
Reply
#5
Thanks for the help M@lik, yet another tool to learn Smile Awk is definitely better suited for this.
Reply
#6
Thanks a lot to all, appreciated also & follow closely this thread too !
Reply