AppleWorks / ClarisWorks
From Wiki.wirelust.com
Contents |
Overview
Clarisworks / Appleworks documents are a closed file format. This page is an attempt to describe this format(s) for the purpose of importing documents into newer more open formats for archiving.
After being frustrated that I wasn't able to normalize old files with Xena I set out figure out how to read this file to develop a plugin. I was sure that someone had already written a plugin for OpenOffice or KOffice or something. This page is a collection of all of the resources and info I have been able to find, as well as my own discoveries about this file format. I plan on continuing work until I have enough knowledge to develop a reader that can at least extract text and some basic formatting of simple documents. From what I can tell, I think this first goal can be met. If anyone else out there can find anything else on their own, please update this wiki or email me your thoughts.
code and test files can be found at: https://github.com/teacurran/appleworks-parser
Priorities
- discover how to determine the start of the content block
- Figure out the format of DSET
- discover how to read the style attributes to apply to the content
- develop plugin for Xena
- develop plugin for general use
Example Files
Please email me any examples you might have, especially if you have a version of ClarisWorks older than 5.0.
File Format
Keywords
There appear to be several keywords
| keyword | type | can contain | description | notes |
|---|---|---|---|---|
| BBAR | ||||
| CHAR | ||||
| CELL | ||||
| CPRT | variable | first 4 bytes indicate length of block v6 contains xml with printing information | ||
| DSET | appear to have a format like: 4 byte Len value 4 byte Len value continuing, not sure when it ends. | |||
| DSUM | variable | Document summary | First 4 bytes indicate length of block | |
| ETBL | ||||
| FNTM | blocked | something to do with fonts | ||
| GRPH | ||||
| HASH | Appears in multiples of 2? always preceded by: FF FF 00 00 00 06 00 04 00 01 | |||
| HDNI | variable | First 4 bytes indicate length of block | ||
| KSEN | preceded by?: FF FF 00 00 00 0E 00 0A 00 02 | |||
| LKUP | preceded by?: FF FF 00 00 00 02 00 04 00 02 | |||
| LOM! | don't know if this is a keyword but putting it here just in case | |||
| NAME | ||||
| RULR | probably page rulers unable to determine the length | |||
| MARK | MRKS MOBJ | First 4 bytes indicate length of block | ||
| MRKS | ||||
| oBIN | ||||
| SNAP | variable | snapshot | First 4 bytes indicate length of block then there is 5 bytes that are unknown, probably payload type, then a PICT file. possibly v6 only. | |
| STYL | HASH NAME FNTM CELL GRPH RULR | First 4 bytes indicate length of block | ||
| TNAM | Different on every save | |||
| WMBT |
Markers
I am making a guess that these are markers, still trying to figure out the meaning of each.
| marker | type | can contain | description | notes | observed length v5 | observed length v6 |
|---|---|---|---|---|---|---|
| 0x0000FFFF | ||||||
| 0x0001FFFF | ||||||
| 0x0101FFFF | 68 | |||||
| 0x0003FFFF | ||||||
| 0x0005FFFF | 176 | 160 | ||||
| 0x0007FFFF | ||||||
| 0x7FFFFFFF | ||||||
| 0x000BFFFF | ||||||
| 0x000DFFFF | ||||||
| 0x0E01FFFF | 80 |
Document Header
| chunk id | position start | length (bytes) | description | example | ascii or int | comments |
|---|---|---|---|---|---|---|
| 1 | 0 | 1 | major version | 05 06 | confirmed | |
| 2 | 2 | 3 | additional version | 029900 07E100 | appears somewhat random but is specific to minor version, maybe platform | |
| 3 | 8 | 4 | creator type | 424F424F | BOBO | Always has the same value |
| 4 | 8 | 4 | previous version | 029900 07E100 | If file was converted this will contain the previous major and additional version number. If not converted it will be the same as 0-8 | |
| 5 | 12 | 8 | 0x00000000 0x00000000 | seems to always be full of zeros | ||
| 6 | 20 | 2 | 0x0001 | seems to always be 0x0001 | ||
| 7 | 22 | 2 | 0x0194 0x01CD | some sort of marker - will appear not too far ahead of this block. | ||
| 8 | 24 | 2 | is usually the same after each instance of block, but sometimes different. | |||
| 9 | 26 | 4 | 0x00000000 | |||
| 10 | 30 | 2 | page height | 792 612 | page width in pts. ie: 792x612 for portrait, 612x792 for landscape | |
| 11 | 32 | 2 | page width | |||
| 12 | 34 | 12 | margins | 0x0048 0x0048 0x0048 0x0048 0x0048 0x0048 | HHHHHH | margins |
| 13 | 46 | 2 | inner height | will be equal to #10 minus either right or left, not sure which yet | ||
| 14 | 48 | 2 | inner width | will be equal to #11 minus either top or bottom margin, not sure which yet | ||
| 15 | 50 | 2 | 0x01 | same in all files tested | ||
| 16 | 52 | 2 | 0x00 | same in all files tested | ||
| 17 | 54 | 2 | 0x01 | same in all files tested | ||
| 18 | 56 | 2 | 0x00 | same in all files tested | ||
| 19 | ? | 8 | 4 | 0x0005FFFF | ||
| 20 | ? | 4 | end header??? | 7FFFFFFF | appears in all files tested. position: 680 - 5.0v1 672 - 6.2.9 | |
| 21 | after last block | 4 | length of next block after next | |||
| 22 | after last | 46 | unknown | |||
| 23 | after last | determined by number in #21 | unknown |
- there is a 2 byte delimiter shortly after the header that is used throughout the document.
Document Info
- there is a summary stored after the main header but before the first DSET
| desc | length (bytes) | notes |
|---|---|---|
| full length + 1 | 4 | |
| abbreviated length | 1 |
- This is used to store an abbreviated table of properties for:
- Title
- Author
- Version
- Keywords
- Category
- Description
- each field is allowed 255 bytes of content
- full content is always available in the DSUM section
Document Content
Content Appears to start right after the end of the first DSET block
Strings in the document start with the first 4 bytes indicating the length of the string
The content area will have several strings in a row without any termination
The last string appears to be null terminated.
- footnotes show up in the text as 0x02
Document TOC
The TOC can contain any number of markers in any order. The data area always starts and ends with ETBL.
| position start | length (bytes) | description | example | ascii | comments |
|---|---|---|---|---|---|
| start position determined by other ETBL | 4 | tag | 4554424C | ETBL | Value indicates the total length of data in ETBL |
| anywhere | 4 | data | oBIN | oBIN block offset from start of doc | |
| anywhere | 4 | tag | 4453554D | DSUM | DSUM block offset |
| anywhere | 4 | data | STYL | STYL block offset | |
| anywhere | 4 | data | BBAR | ||
| anywhere | 4 | data | MARK | MARK block offset | |
| anywhere | 4 | data | MRKS | ||
| EOF - 24 | 4 | tag | 4554424C | ETBL | Following Value indicates start position |
Misc
in both versions tested, document ends with:
FF FE FD FC FB FA F9 F8 F0 F1 F2 F3 F4 F5 F6 F7
Passwords
- password protected documents do not have their content protected.
- password is not stored in the file
- it probably stores a checksum because there isn't much difference in password length
Other Elements
Other Efforts
ABIWord
- ClarisWorks import for ABIWord (non-functional but has some info)
OpenOffice / StarOffice
If you do a ton of google searches, you find a lot of pages that say that StarOffice could open ClarisWorks documents. This was done with the W4W filter. After a lot of digging, I believe that these filters live in OpenOffice in the Framework project. After checking out the source for the Framework project, I believe that the ClarisWorks import support was non-existant. If I am reading the source correctly, it looks like this filter simply opens the document as ASCII. If this is the case, I don't know why they even bothered to say they had a filter, if this is not the case someone please correct me.
Propriatary
- Old versions of DataViz can convert documents. product appears dead but still for sale.
- MacText can convert older .cwk files to rtf, word2, and Word Perfect
- XTND - there is a lot of info out there about XTND filters as part of system 6 and 7. I would like to investigate if copies of these filters could help this effort but I haven't been able to find enough info yet.
Misc
- Forum message from someone looking into the format (from 2001, ha!)

