Posts: 38
Threads: 13
Joined: Jul 2023
Hello
Could you please give me an advice, which text editor or maybe other software could I use to clean duplicates in the folder filled with a lot of .txt files, which have about 200gb total size? I mean , not the duplicates in every file separately, but to clean duplicates all over those all files in the same folder.
I am using notepad to open those 10+ GB txt files, but i need to clean them up.
So i am thinking about using 2 ways:
merge all txt files into one large ( its size will be > RAM)
Find tools\software to clean folder with several .txt files.
Posts: 889
Threads: 15
Joined: Sep 2017
(07-13-2023, 02:47 PM)ataman4uk Wrote: Hello
Could you please give me an advice, which text editor or maybe other software could I use to clean duplicates in the folder filled with a lot of .txt files, which have about 200gb total size? I mean , not the duplicates in every file separately, but to clean duplicates all over those all files in the same folder.
I am using notepad to open those 10+ GB txt files, but i need to clean them up.
So i am thinking about using 2 ways:
merge all txt files into one large ( its size will be > RAM)
Find tools\software to clean folder with several .txt files.
just for searching fileduplicates
the onyl thing you can do progging yourself a little script or using linux basic programms like cat -> sort -> uniq
Posts: 38
Threads: 13
Joined: Jul 2023
(07-13-2023, 02:56 PM)Snoopy Wrote: (07-13-2023, 02:47 PM)ataman4uk Wrote: Hello
Could you please give me an advice, which text editor or maybe other software could I use to clean duplicates in the folder filled with a lot of .txt files, which have about 200gb total size? I mean , not the duplicates in every file separately, but to clean duplicates all over those all files in the same folder.
I am using notepad to open those 10+ GB txt files, but i need to clean them up.
So i am thinking about using 2 ways:
merge all txt files into one large ( its size will be > RAM)
Find tools\software to clean folder with several .txt files.
just for searching fileduplicates
the onyl thing you can do progging yourself a little script or using linux basic programms like cat -> sort -> uniq
thank you
i am using windows as OS, and i ve used Git Bash, for example, for splitting those files , so each one is 4gb and it not takes 1 hhour to open 200gb text file. And now I am cofused how to clean them.
It seems like i should merge them back into one large
Posts: 889
Threads: 15
Joined: Sep 2017
as i mentioned you can process files one by one or merging some more with
cat 1.txt 2.txt ... | sort | uniq >> uniq.uniq
or with for loop
for file in *.txt; do cat $file | sort | uniq >> uniq.uniq; done
you can do this file by file with the shown for loop but anyway you will need another last run for the resulting uniq.uniq, as this is quick n dirty and the uniq.uniq can or will have still dups as different input files are not compared to each other, just the content of the file itself (but this way its just a oneliner for bash)
depending on input this uniq.uniq can still get very large, maybe you will need to split the work to 10-20 text files, make a uniq.uniq for this 10-20, run cat sort uniq on this result file, store it and move to the next batch of files, at the end merge the uniq files a last time and run cat sort uniq
Posts: 38
Threads: 13
Joined: Jul 2023
07-13-2023, 05:56 PM
(This post was last modified: 07-13-2023, 05:58 PM by blaster666.)
(07-13-2023, 04:05 PM)Snoopy Wrote: as i mentioned you can process files one by one or merging some more with
cat 1.txt 2.txt ... | sort | uniq >> uniq.uniq
or with for loop
for file in *.txt; do cat $file | sort | uniq >> uniq.uniq; done
you can do this file by file with the shown for loop but anyway you will need another last run for the resulting uniq.uniq, as this is quick n dirty and the uniq.uniq can or will have still dups as different input files are not compared to each other, just the content of the file itself (but this way its just a oneliner for bash)
depending on input this uniq.uniq can still get very large, maybe you will need to split the work to 10-20 text files, make a uniq.uniq for this 10-20, run cat sort uniq on this result file, store it and move to the next batch of files, at the end merge the uniq files a last time and run cat sort uniq
Thank you so much. Cat is working for me. I am checking for duplicates first half(s) of 2 different files, which are 16 GB total.
My RAM is 64GB
PC is working hard, as i could hear the fan sound, and Git Bash window is responding and everything seems ok,
uniq.uniq file is 0 kb, after 10 minutes of progress
My question is : is there any possibility to add "Loading" % Bar, to track the progress status?
Posts: 25
Threads: 2
Joined: Apr 2023
2 other options for you would be to install Notepad++ and go CTRL + A
Edit > Line Operations > Remove Duplicates
OR
https://jpm22.github.io/txt/
go to the remove duplicate file option and select Big File
The second method may be better as it gives the option for Case Sensitive
Hope this helps!
Posts: 38
Threads: 13
Joined: Jul 2023
07-13-2023, 11:15 PM
(This post was last modified: 07-13-2023, 11:21 PM by blaster666.)
(07-13-2023, 04:05 PM)Snoopy Wrote: as i mentioned you can process files one by one or merging some more with
cat 1.txt 2.txt ... | sort | uniq >> uniq.uniq
or with for loop
for file in *.txt; do cat $file | sort | uniq >> uniq.uniq; done
you can do this file by file with the shown for loop but anyway you will need another last run for the resulting uniq.uniq, as this is quick n dirty and the uniq.uniq can or will have still dups as different input files are not compared to each other, just the content of the file itself (but this way its just a oneliner for bash)
depending on input this uniq.uniq can still get very large, maybe you will need to split the work to 10-20 text files, make a uniq.uniq for this 10-20, run cat sort uniq on this result file, store it and move to the next batch of files, at the end merge the uniq files a last time and run cat sort uniq
Could you tell me , please, how could that happen
I ve used cat to sort unique 2 txts
1 92 million strings, another 42 million strings
And it is giving me 350 million strings file as an output
oh now is ee how did that happened.
It just saved output file 2 times
Posts: 119
Threads: 1
Joined: Apr 2022
(07-13-2023, 02:47 PM)ataman4uk Wrote: Hello
Could you please give me an advice, which text editor or maybe other software could I use to clean duplicates in the folder filled with a lot of .txt files, which have about 200gb total size? I mean , not the duplicates in every file separately, but to clean duplicates all over those all files in the same folder.
I am using notepad to open those 10+ GB txt files, but i need to clean them up.
So i am thinking about using 2 ways:
merge all txt files into one large ( its size will be > RAM)
Find tools\software to clean folder with several .txt files.
Have a look at https://github.com/Cynosureprime/rling
|