Thank you for you're reply, rurasort looks interesting.
Am still thinking on how i would make this and not make it painfully slow. Might be really complicated and not worth it.
EDIT:
Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.
EDIT2:
Am really bad with keeping up with forum posts. So am just gonna post the rough piece of code
rurasort.py --digit-trim --special-trim --lower 34.844s 10M lines
code below doing the same thing 2.503s for 10M lines
Am still thinking on how i would make this and not make it painfully slow. Might be really complicated and not worth it.
EDIT:
Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.
EDIT2:
Am really bad with keeping up with forum posts. So am just gonna post the rough piece of code
rurasort.py --digit-trim --special-trim --lower 34.844s 10M lines
code below doing the same thing 2.503s for 10M lines
Code:
import multiprocessing as mp,os
path = "/mnt/NVMe/wordlist_10M.txt"
cores = 8
def process(line):
newstring = line.lstrip('1,2,3,4,5,6,7,8,9,0')
newstring = newstring.rstrip('1,2,3,4,5,6,7,8,9,0')
newstring = newstring.lstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")
newstring = newstring.rstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")
print newstring.lower()
def process_wrapper(chunkStart, chunkSize):
with open(path) as f:
f.seek(chunkStart)
lines = f.read(chunkSize).splitlines()
for line in lines:
process(line)
def chunkify(fname,size=1024*1024):
fileEnd = os.path.getsize(fname)
with open(fname,'r') as f:
chunkEnd = f.tell()
while True:
chunkStart = chunkEnd
f.seek(size,1)
f.readline()
chunkEnd = f.tell()
yield chunkStart, chunkEnd - chunkStart
if chunkEnd > fileEnd:
break
#init objects
pool = mp.Pool(cores)
jobs = []
#create jobs
for chunkStart,chunkSize in chunkify(path):
jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )
#wait for all jobs to finish
for job in jobs:
job.get()
#clean up
pool.close()