fgets-sse2 v0.01 released
Hey Guys,

fast loading a dictionary is an important part in a password cracker and the use of fgets(3) is not perfectly suited for high performance textfile parsing.

Here is my SSE2 accelerated version of fgets(3) that is ~4 times faster than glibcs version. It should be possible to replace glibc fgets in your project with it.

Especially password cracking programs can benefit from it. But I am sure there are other project types that can benefit from it as well.

Feel free to integrate it into your project. I've released it as open-source under GPLv2.


See README for details.

very nice!
wow, very cool. So you even put in some time to solve the last few problems Smile
Text parsing... Like a boaws.
Sorry for reviving this thread.

Out of my own curiosity, I translated the len utility to use fgets_sse2 instead of fgets and am seeing some odd behavior. Output:

mangix@Mangix ~/testing
$ len 2 6 < enwik8 | wc -l

mangix@Mangix ~/testing
$ ./len 2 6 < enwik8 | wc -l

Oddly enough, when I use rockyou.txt everything is fine:

mangix@Mangix ~/testing
$ len 2 6 < rockyou.txt | wc -l

mangix@Mangix ~/testing
$ ./len.exe 2 6 < rockyou.txt | wc -l

When I piped the outputs to two different files and compared them, It seems that then version using fgets_sse2 is not keeping lines which have a . at the end. Maybe it's an issue with newline characters being handled differently. enwik8 is available here: http://mattmahoney.net/dc/enwik8.zip

edit: I found a secondary problem. It looks like when you feed it wordlists that have \r\n at the end of a line, the \r gets treated as part of the word. Looks like filtering is needed.
Hi guys, is this still alive?

Have stumblod upon this one and gave it a try with VS on Windows; seems to work fine for me, but I don't get near to 4x speedup; *only* about a dobule compared to fgets from vs runtime. and about 30% compared to doing same thing over buffer read with fread without intrinsics. I guess this version of sse2 optimized fgets is similar to using fread in concept, but performance come from parallell processing with sse2.

With fread:

duration 0: 0.957747
duration 1: 0.869212
duration 2: 0.842793
Mean value: 0.889917

With fgets:

duration 0: 1.258209
duration 1: 1.258632
duration 2: 1.258991
Mean value: 1.258611

With sse2 optimizied fgets:

duration 0: 0.661175
duration 1: 0.661901
duration 2: 0.661526
Mean value: 0.661534

I used 100 meg of random generated ASCII with relatively uniform distribution of new lines. I had to do some chanes to code to get it to compile with VS (just in struct declarations). My test code can be seen here: http://www.nextpoint.se/?p=580 and if someone would like modified code I can make it avialable.
If file does not end in new line, than last line is discarded.

I fix the issue is in code below:
for (i = 0; i < BUFFER_EXTRA; i++, num++)
if (ctx->buf[num - 1] == '\n'){
const int c = fgetc (stream);

if (c == EOF) {
ctx->buf[num] = '\0'; <--- this seems to fix it.
ctx->buf[num] = c;

After looking the code, I think that most of speedup is actually comming from using fread rather than assembly, though assembly adds some speedup.

Btw, if anyone is going to use that code, all those malloced/calloced buffers needs to be freed as well.