Question about utf8 files
Hello there
Does anyone know of a simple way of finding out whether a file has got a particular field unstranslated looking at the characters? I mean, say a field needs to have Chinese, Arabic or Russian words only, I'd like to check if this is the case, or if it still has got only (or any) western characters.
Can't think of a way of doing this automatically (and sometimes I need to check files that are thousands of lines long.
Cheers.
P.
Re: Question about utf8 files
I think an easy way to do this good be to check any character with ASCII value of 255 or less if you want to rule out latin languages you might need to raise the bar a little.
Hugs,
James
Re: Question about utf8 files
Hi James
Thanks for the tip. At the end I settled for this simple solution:
Code:
sed 's/<[^>]*>//g' *.xml |grep -o "[a-zA-Z]*"
That is, first stripped off xml tags and then "grepped" any sequence of letters in the ascii range. It produces a list of words which are not in Chinese, or Arabic. It wouldn't work well for Western languages, but it did the job for what I needed. I'm sure it can be improved. For some reason, leaving out the "-o" flag makes it malfunction.
Cheers.
P.
Re: Question about utf8 files