+ Reply to Thread
Results 1 to 6 of 6

Thread: TTX structure

 
  1. #1
    Contributing User
    Join Date
    May 2011
    Posts
    166
    Rep Power
    103

    Default TTX structure

    Hi there

    I'm starting to have a look at .ttx files created with Trados. To start with they have the same internal structure of null characters before (or is it after?) ASCII characters. I've found a better way of getting rid of them (better than the sed script I posted in another thread):

    Code:
    pabloa:~/Development$ hexdump -C test_all.xml.ttx | head
    00000000  ff fe 3c 00 3f 00 78 00  6d 00 6c 00 20 00 76 00  |..<.?.x.m.l. .v.|
    00000010  65 00 72 00 73 00 69 00  6f 00 6e 00 3d 00 27 00  |e.r.s.i.o.n.=.'.|
    00000020  31 00 2e 00 30 00 27 00  3f 00 3e 00 0d 00 0a 00  |1...0.'.?.>.....|
    00000030  3c 00 54 00 52 00 41 00  44 00 4f 00 53 00 74 00  |<.T.R.A.D.O.S.t.|
    00000040  61 00 67 00 20 00 56 00  65 00 72 00 73 00 69 00  |a.g. .V.e.r.s.i.|
    00000050  6f 00 6e 00 3d 00 22 00  32 00 2e 00 30 00 22 00  |o.n.=.".2...0.".|
    00000060  3e 00 3c 00 46 00 72 00  6f 00 6e 00 74 00 4d 00  |>.<.F.r.o.n.t.M.|
    00000070  61 00 74 00 74 00 65 00  72 00 3e 00 3c 00 54 00  |a.t.t.e.r.>.<.T.|
    00000080  6f 00 6f 00 6c 00 53 00  65 00 74 00 74 00 69 00  |o.o.l.S.e.t.t.i.|
    00000090  6e 00 67 00 73 00 20 00  43 00 72 00 65 00 61 00  |n.g.s. .C.r.e.a.|
    pabloa:~/Development$ cat test_all.xml.ttx | tr -d "\0" >test.clean
    pabloa:~/Development$ hexdump -C test.clean | head
    00000000  ff fe 3c 3f 78 6d 6c 20  76 65 72 73 69 6f 6e 3d  |..<?xml version=|
    00000010  27 31 2e 30 27 3f 3e 0d  0a 3c 54 52 41 44 4f 53  |'1.0'?>..<TRADOS|
    00000020  74 61 67 20 56 65 72 73  69 6f 6e 3d 22 32 2e 30  |tag Version="2.0|
    00000030  22 3e 3c 46 72 6f 6e 74  4d 61 74 74 65 72 3e 3c  |"><FrontMatter><|
    00000040  54 6f 6f 6c 53 65 74 74  69 6e 67 73 20 43 72 65  |ToolSettings Cre|
    00000050  61 74 69 6f 6e 44 61 74  65 3d 22 32 30 31 31 30  |ationDate="20110|
    00000060  36 31 30 54 31 38 34 33  35 38 5a 22 20 43 72 65  |610T184358Z" Cre|
    00000070  61 74 69 6f 6e 54 6f 6f  6c 3d 22 53 44 4c 20 54  |ationTool="SDL T|
    00000080  52 41 44 4f 53 20 54 61  67 45 64 69 74 6f 72 22  |RADOS TagEditor"|
    00000090  20 43 72 65 61 74 69 6f  6e 54 6f 6f 6c 56 65 72  | CreationToolVer|
    It can be seen that, except for the first two characters that should be stripped off, it is an xml file. I wonder if someone has got insights on the inner structure of this xml file. In the meantime I'll keep posting my findings.

    Cheers.
    P.

  2. #2
    Contributing User
    Join Date
    May 2011
    Posts
    166
    Rep Power
    103

    Default Re: TTX structure

    The method suggested above is incorrect and causes all kinds of havoc with the files. These TTX files are encoded in UTF-16, and that's the reason of the strange looking content. The proper way of dealing with this is to convert the file into UTF-8 and then everything goes well:

    Code:
    pabloa:~/Development$ iconv -f utf16 -t utf8 test_all.xml.ttx
    Sorry for the mistake.

    Cheers.
    P.

  3. #3
    Contributing User
    Join Date
    May 2011
    Posts
    166
    Rep Power
    103

    Default Re: TTX structure

    I'll carry on with the analysis of a TTX file. As an example, here there is a short file:

    Code:
    <?xml version='1.0'?>
    <TRADOStag Version="2.0">
      <FrontMatter>
        <ToolSettings [some tools setting]/>
        <UserSettings [some user settings]/>
      </FrontMatter>
      <Body>
        <Raw>
          <ut Class="procinstr" DisplayText="Instruction">&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;</ut>
          <ut Type="start" Style="external" RightEdge="angle" DisplayText="root_tag">&lt;root_tag source-language=&quot;en&quot; version=&quot;1.0&quot; date-created=&quot;2011-06-14T16:04:21.734945&quot;&gt;</ut>
          <ut Type="start" Style="external" RightEdge="angle" DisplayText="field">&lt;field type=&quot;TextField&quot; name=&quot;title&quot;&gt;</ut>
          <Tu Origin="manual" MatchPercent="99">
            <Tuv Lang="EN-US">To be or not to be</Tuv>
            <Tuv Lang="ES-EM">Ser o no ser</Tuv>
          </Tu>
          <ut Type="end" Style="external" LeftEdge="angle" DisplayText="field">&lt;/field&gt;</ut>
          <ut Type="end" Style="external" LeftEdge="angle" DisplayText="root_tag">&lt;/root_tag&gt;</ut>
        </Raw>
      </Body>
    </TRADOStag>
    The root tag is TRADOStag, and it's got a FrontMatter grouping the settings, and a Body for the actual content of the source and target languages. The general structure can be guessed at easily, but we'll make it explicit in later posts.

    Cheers.
    P.
    Last edited by pabloa; 06-22-2011 at 11:52 AM.

  4. #4
    Contributing User
    Join Date
    May 2011
    Posts
    166
    Rep Power
    103

    Default Re: TTX structure

    Ok, let's carry on...

    From this XML file, we are only interested on things which are inside the "Raw" tag. And within this tag we are going to find two tags relevant for what we want: "ut" and "Tu" (within "Tu" tags there is further structure, but we will look at this the next time).

    Roughly the information within these tags can be thought of like this: "ut" has the information regarding the underlying XML structure (don't forget that this is an XML file with instructions on how to build another XML file). On the other hand, "Tu" has the information on the content of the tags specified by "ut", which has to do with source and target language, amount of matching, etc.

    Take, for example, the first "ut" tag:
    Code:
    <ut Class="procinstr" DisplayText="Instruction">&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;</ut>
    If we strip off the beginning and ending tag, and then replace the html entities (&lt; = < ; &gt; = > ; &quot; = ") we are left with

    Code:
    <?xml version="1.0" encoding="utf-8"?>
    which is the first line we need in a well formed XML file.

    We'll carry on next time.

    Cheers.
    P.

  5. #5
    Contributing User
    Join Date
    May 2011
    Posts
    166
    Rep Power
    103

    Default Re: TTX structure

    Next line:
    Code:
    <ut Type="start" Style="external" RightEdge="angle" DisplayText="root_tag">
       &lt;root_tag source-language=&quot;en&quot; version=&quot;1.0&quot; date-created=&quot;2011-06-14T16:04:21.734945&quot;&gt;
    </ut>
    Again, stripping off the "ut" tag and replacing the html entities in its content we are left with

    Code:
    <root_tag source-language="en" version="1.0" date-created="2011-06-14T16:04:21.734945">
    Does this look too easy? That's because it is! Note that we are not using the information from the attributes to the "ut" tag: Style, RightEdge, not even DisplayText. That information is redundant.

    Cheers.
    P.
    Last edited by pabloa; 06-30-2011 at 11:11 AM.

  6. #6
    Contributing User
    Join Date
    May 2011
    Posts
    166
    Rep Power
    103

    Default Re: TTX structure

    Hi there

    Moving on with this, the next "ut" tag is similar to the previous one, but with "field" instead of "root_tag". Then we have a "Tu", which indicates that we start a translation unit. Within this tag there are two "Tuv" with an attribute each, saying the language of the segments of text. We don't need the information in the attributes of the "Tu" tag (Origin and MatchPercent), so it's safe to discard all this. And depending on which language we are extracting, we keep the segment we need.

    The last two "ut" tags are the closing tags for "field" and "root_tag" respectively.

    Summing up, then, after all these transformations we are left with the following XML file:

    Code:
    
    <?xml version="1.0" encoding="utf-8"?>
    <root_tag source-language="en" version="1.0" date-created="2011-06-14T16:04:21.734945">
      <field type="TextField" name="title">
        Ser o no ser
      </field>
    </root_tag>
    And that's it! There are a couple of things to be aware of when doing these transormations, not big deal, so I'll leave them for next time.

    Cheers.
    P.
    Last edited by pabloa; 07-01-2011 at 10:42 AM.

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •