AppleWorks / ClarisWorks

From Wiki.wirelust.com

Jump to: navigation, search

Contents

Overview

Clarisworks / Appleworks documents are a closed file format. This page is an attempt to describe this format(s) for the purpose of importing documents into newer more open formats for archiving.


After being frustrated that I wasn't able to normalize old files with Xena I set out figure out how to read this file to develop a plugin. I was sure that someone had already written a plugin for OpenOffice or KOffice or something. This page is a collection of all of the resources and info I have been able to find, as well as my own discoveries about this file format. I plan on continuing work until I have enough knowledge to develop a reader that can at least extract text and some basic formatting of simple documents. From what I can tell, I think this first goal can be met. If anyone else out there can find anything else on their own, please update this wiki or email me your thoughts.

code and test files can be found at: https://github.com/teacurran/appleworks-parser

Priorities

  • discover how to determine the start of the content block
  • Figure out the format of DSET
  • discover how to read the style attributes to apply to the content
  • develop plugin for Xena
  • develop plugin for general use

Example Files

Please email me any examples you might have, especially if you have a version of ClarisWorks older than 5.0.

File Format

Keywords

There appear to be several keywords

Keywords
keyword type can contain description notes
BBAR
CHAR
CELL
CPRT variable first 4 bytes indicate length of block
v6 contains xml with printing information
DSET appear to have a format like:
4 byte Len
value
4 byte Len
value
continuing, not sure when it ends.
DSUM variable Document summary First 4 bytes indicate length of block
ETBL
FNTM blocked something to do with fonts
GRPH
HASH Appears in multiples of 2?
always preceded by: FF FF 00 00 00 06 00 04 00 01
HDNI variable First 4 bytes indicate length of block
KSEN preceded by?: FF FF 00 00 00 0E 00 0A 00 02
LKUP preceded by?: FF FF 00 00 00 02 00 04 00 02
LOM! don't know if this is a keyword but putting it here just in case
NAME
RULR probably page rulers
unable to determine the length
MARK MRKS
MOBJ
First 4 bytes indicate length of block
MRKS
oBIN
SNAP variable snapshot First 4 bytes indicate length of block
then there is 5 bytes that are unknown, probably payload type, then a PICT file.

possibly v6 only.

STYL HASH
NAME
FNTM
CELL
GRPH
RULR
First 4 bytes indicate length of block
TNAM Different on every save
WMBT

Markers

I am making a guess that these are markers, still trying to figure out the meaning of each.

Keywords
marker type can contain description notes observed length v5 observed length v6
0x0000FFFF
0x0001FFFF
0x0101FFFF 68
0x0003FFFF
0x0005FFFF 176 160
0x0007FFFF
0x7FFFFFFF
0x000BFFFF
0x000DFFFF
0x0E01FFFF 80

Document Header

Document Header
chunk id position start length (bytes) description example ascii or int comments
1 0 1 major version 05
06
confirmed
2 2 3 additional version 029900
07E100
appears somewhat random but is specific to minor version, maybe platform
3 8 4 creator type 424F424F BOBO Always has the same value
4 8 4 previous version 029900
07E100
If file was converted this will contain the previous major and additional version number. If not converted it will be the same as 0-8
5 12 8 0x00000000 0x00000000 seems to always be full of zeros
6 20 2 0x0001 seems to always be 0x0001
7 22 2 0x0194
0x01CD
some sort of marker - will appear not too far ahead of this block.
8 24 2 is usually the same after each instance of block, but sometimes different.
9 26 4 0x00000000
10 30 2 page height 792
612
page width in pts. ie: 792x612 for portrait, 612x792 for landscape
11 32 2 page width
12 34 12 margins 0x0048 0x0048 0x0048 0x0048 0x0048 0x0048 HHHHHH margins
13 46 2 inner height will be equal to #10 minus either right or left, not sure which yet
14 48 2 inner width will be equal to #11 minus either top or bottom margin, not sure which yet
15 50 2 0x01 same in all files tested
16 52 2 0x00 same in all files tested
17 54 2 0x01 same in all files tested
18 56 2 0x00 same in all files tested
19  ? 8 4 0x0005FFFF
20  ? 4 end header??? 7FFFFFFF appears in all files tested. position:
680 - 5.0v1
672 - 6.2.9
21 after last block 4 length of next block after next
22 after last 46 unknown
23 after last determined by number in #21 unknown
  • there is a 2 byte delimiter shortly after the header that is used throughout the document.

Document Info

  • there is a summary stored after the main header but before the first DSET
desc length (bytes) notes
full length + 1 4
abbreviated length 1
  • This is used to store an abbreviated table of properties for:
    • Title
    • Author
    • Version
    • Keywords
    • Category
    • Description
  • each field is allowed 255 bytes of content
  • full content is always available in the DSUM section

Document Content

Content Appears to start right after the end of the first DSET block

Strings in the document start with the first 4 bytes indicating the length of the string

The content area will have several strings in a row without any termination

The last string appears to be null terminated.

  • footnotes show up in the text as 0x02

Document TOC

The TOC can contain any number of markers in any order. The data area always starts and ends with ETBL.

Document TOC - at end of file
position start length (bytes) description example ascii comments
start position determined by other ETBL 4 tag 4554424C ETBL Value indicates the total length of data in ETBL
anywhere 4 data oBIN oBIN block offset from start of doc
anywhere 4 tag 4453554D DSUM DSUM block offset
anywhere 4 data STYL STYL block offset
anywhere 4 data BBAR
anywhere 4 data MARK MARK block offset
anywhere 4 data MRKS
EOF - 24 4 tag 4554424C ETBL Following Value indicates start position

Misc

in both versions tested, document ends with:
FF FE FD FC FB FA F9 F8 F0 F1 F2 F3 F4 F5 F6 F7

Passwords

  • password protected documents do not have their content protected.
  • password is not stored in the file
  • it probably stores a checksum because there isn't much difference in password length

Other Elements

Other Efforts

ABIWord

OpenOffice / StarOffice

If you do a ton of google searches, you find a lot of pages that say that StarOffice could open ClarisWorks documents. This was done with the W4W filter. After a lot of digging, I believe that these filters live in OpenOffice in the Framework project. After checking out the source for the Framework project, I believe that the ClarisWorks import support was non-existant. If I am reading the source correctly, it looks like this filter simply opens the document as ASCII. If this is the case, I don't know why they even bothered to say they had a filter, if this is not the case someone please correct me.

Propriatary

  • Old versions of DataViz can convert documents. product appears dead but still for sale.
  • MacText can convert older .cwk files to rtf, word2, and Word Perfect
  • XTND - there is a lot of info out there about XTND filters as part of system 6 and 7. I would like to investigate if copies of these filters could help this effort but I haven't been able to find enough info yet.

Misc

Personal tools