foreign words and --remove
#1
I've got two questions I can't find the answer to:

1) I do my hashcat runs using --remove thinking it might be faster since hashcat doesn't have to check a particular hash again. But now I'm wondering if the disk i/o from having to update the hash file more than offsets any checking time saved. I can't be the only one to think of this but I can't find anything either way. Is it better to use --remove (speed-wise) when hunting for passwords, or using it with the found password list from the last run before the next, or don't use it? I have a backup of my original list, so I'm not worried about screwing that up.

2) What character set does hashcat assume/use? For example, the spanish alphabet (and keyboard) has a ñ character. Could I use a wordlist with that character without hex encoding it? What about in a mask?
#2
for question number 1 it depends on the type of hashes and also on your hardware (SSD, RAM, CPU) etc. In my experience, you shouldn't bother too much about removing hashes if you do not crack (and therefore need to remove) at least > 15 % of the hashes. I mean, if hashcat only removes some hundreds of hashes of a multi-million hash list, the remove wouldn't change much at all (the change is negligible and the time spend to remove this < 1% of the hashes might be much higher than the time needed to load the hashes).
The --remove has most importantly an impact on the initialization/startup phase when the hashes are being loaded. Hashcat will anyway remove (and therefore ignore) all the hashes that are already present in the pot file (if you do not disable potfile support with --potfile-disable). The cracking speed will almost exactly be the same as if you did not load the hashes at all (i.e. as if the hashes were already removed externally or with --remove on the previous run). Therefore, the main impact is on the first few seconds when hashes need to be parsed and checked (but as I already mentioned above, this will be negligible if you remove only very few hashes from a multi-million hash list).

For what regards question number 2, it depends on the encoding. hashcat (and actually the hashing algorithms) work on a byte-by-byte level. That means that the character ñ could be encoded with different character encodings. If it is encoded in utf8, it will use at least 2 bytes:
Code:
echo -n ñ | xxd -p
c3b1

Masks also work on a byte-by-byte level. Therefore, this character (if utf8 was the encoding that the input was provided to the original hash generation algorithm) would need a mask of length 2 just to crack "one character": for instance a mask of ?b?b (but of course you can use a much more specific mask of length 2).

On the other hand, if a different encoding was used (attention you need to generate it yourself, since the forum software will always convert it to utf8) for instance this (ISO-8859-1):
Code:
echo -en "\xf1"
ñ
only a mask of length 1 is needed, since hex 0xf1 is just one byte long (compared to 0xc3b1 which is 2 bytes long for utf8).

Yeah, character encoding is kind of difficult (also for some experienced programers/"experts"). The good thing is that a lot of hashes out there just use utf8, so you do not need to waste too much time to find the correct character encoding... but the problem is, that utf8/utf16/utf32 etc are able to use multi-byte characters and therefore the masks need to reflect this by using the correct length (e.g. 2 vs 1 in the example above).
#3
I started typing this reply last night but forgot to post it. I know philsmd has already replied, but as I already had all this typed, it seemed like it would be a waste not to go ahead and post it Tongue

Your assumptions regarding IO are correct: --remove will kill performance if you have a very high hit rate and/or very large hash list. Keep in mind that ever since Hashcat got potfile support a few years back, it won't attempt to re-crack a hash you've previously cracked provided the hash is found in the potfile. As kind of a general rule of thumb, options like --remove and -o really only make sense if used in conjunction with --potfile-disable.

Hashcat makes no assumptions about encoding, it only hashes bytes. If the correct bytes are in the wordlist, then hashcat will find the password. Masks are a bit of a different story, however, in that the builtin charsets only support ascii and hex. If you want a different charset you'll either need to use one of the supplied charset files, or use --hex-charset and roll your own.
#4
Note that latest hashcat versions have this:

Quote: --encoding-from | Code | Force internal wordlist encoding from X | --encoding-from=iso-8859-15
--encoding-to | Code | Force internal wordlist encoding to X | --encoding-to=utf-32le

But this is for wordlist processing only!