cat "multibyte or wide character" error
#1
Hi all. trying to compare and uniq 2 .txt files. Getting this error:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘sorient\342t\r’ and ‘sorient\350rent\r’.

I would be so grateful, if you could advise me how to avoid it. 
Manuall string delete is not working for me, as .txt is really large

I have found that it was because of strings, containing unprintable characters. How could i remove all of them from the txt file? 
I mean, to remove all the strings containing unprintable characters or smth


Attached Files
.png   1к.PNG (Size: 1.73 KB / Downloads: 4)
Reply
#2
(07-13-2023, 11:05 PM)ataman4uk Wrote: Hi all. trying to compare and uniq 2 .txt files. Getting this error:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘sorient\342t\r’ and ‘sorient\350rent\r’.

I would be so grateful, if you could advise me how to avoid it. elastic man
Manuall string delete is not working for me, as .txt is really large

I have found that it was because of strings, containing unprintable characters. How could i remove all of them from the txt file? 
I mean, to remove all the strings containing unprintable characters or smth

From the CPU point of vue, this could probably make sense. Appart from that, the backup "script" wasn't written by myself and I believe that it originally was meant for backing up to a networked drive. But in any case, I believe that rsync is still a good option because of its versatility (local use is documented in the man page) and it's performance with big folders.

cp would imply playing with find and timestamps files, and tar would create archives which doesn't help when you need to get a file back rapidly.

That said, anyone feeling like to discuss this "multibyte character" problem ?
Reply
#3
first, dont open up new threads for questions which are  popping up after recieving an answer in another thread, use the old one

second, one possible answer is already given by sort

Code:
Set LC_ALL='C' to work around the problem.

never the less, there is another great linux tool called iconv

Code:
iconv -c input.txt > output.txt
or
iconv -c -t=UTF-8 input.txt > output.txt

this will strip unprintable chars from input, but never the less, it seems your input files are malformed or have been through some seriuos misconversion between different character encodings which will mostly result in these problems you mentioned
Reply