text editor to delete duplicates from pool of large files
#7
(07-13-2023, 04:05 PM)Snoopy Wrote: as i mentioned you can process files one by one or merging some more with

cat 1.txt 2.txt ... | sort | uniq >> uniq.uniq
or with for loop
for file in *.txt; do cat $file | sort | uniq >> uniq.uniq; done

you can do this file by file with the shown for loop but anyway you will need another last run for the resulting uniq.uniq, as this is quick n dirty and the uniq.uniq can or will have still dups as different input files are not compared to each other, just the content of the file itself (but this way its just a oneliner for bash)

depending on input this uniq.uniq can still get very large, maybe you will need to split the work to 10-20 text files, make a uniq.uniq for this 10-20, run cat sort uniq on this result file, store it and move to the next batch of files, at the end merge the uniq files a last time and run cat sort uniq

Could you tell me , please, how could that happen
I ve used cat to sort unique 2 txts
1 92 million strings, another 42 million strings
And it is giving me 350 million strings file as an output


oh now is ee how did that happened.
It just saved output file 2 times
Reply


Messages In This Thread
RE: text editor to delete duplicates from pool of large files - by blaster666 - 07-13-2023, 11:15 PM