Question about utf8 files

Printable View

07-27-2011
pabloa

Question about utf8 files

Hello there

Does anyone know of a simple way of finding out whether a file has got a particular field unstranslated looking at the characters? I mean, say a field needs to have Chinese, Arabic or Russian words only, I'd like to check if this is the case, or if it still has got only (or any) western characters.

Can't think of a way of doing this automatically (and sometimes I need to check files that are thousands of lines long.

Cheers.
P.
07-27-2011
James Dayton

Re: Question about utf8 files

I think an easy way to do this good be to check any character with ASCII value of 255 or less if you want to rule out latin languages you might need to raise the bar a little.

Hugs,
James
08-01-2011
pabloa

Re: Question about utf8 files

Hi James

Thanks for the tip. At the end I settled for this simple solution:

Code:

sed 's/<[^>]*>//g' *.xml |grep -o "[a-zA-Z]*"

That is, first stripped off xml tags and then "grepped" any sequence of letters in the ascii range. It produces a list of words which are not in Chinese, or Arabic. It wouldn't work well for Western languages, but it did the job for what I needed. I'm sure it can be improved. For some reason, leaving out the "-o" flag makes it malfunction.

Cheers.
P.
08-01-2011
James Dayton

Re: Question about utf8 files

Pretty nice one indeed!

All times are GMT -4. The time now is 01:43 PM.

Powered by vBulletin® Version 4.2.0
Copyright © 2025 vBulletin Solutions, Inc. All rights reserved.

Copyright 2006 - English Spanish Translator