Copyright © 2003 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use, and software licensing rules apply.
One of the most persistently annoying issues in data management is keeping metadata with the data it describes. The most difficult (and important) sort of data to track is the "format" (encoding and media type) of files. There are a variety of platform specific ways to solve parts of the problem (file extensions, filesystem attributes, shebang lines) but none of them survive the various mechanisms for transmitting data entities, from FTP to HTTP to Jabber.
XML has demonstrated the wide applicability of a solution: transmit the metadata as part of the same stream as the data. Furthermore, XML defines (explicitly and implicity) a bootstrapping process whereby you can detect the fact that the data is XML through its XML declaration, its XML version through its version declaration, its encoding through its encoding declaration and its vocabulary through a DOCTYPE or namespace declaration. This series of bootstraps has been wildly successful. With XML 1.1, it is possible for a PalmOS-based XML parser to reliably detect and decode an SVG document encoded in EBCDIC. The Data Description Header is a compact way to allow non-XML files to be similarly self-describing.
1 Overview
2 Rationale
2.1 The File Type Problem
2.2 The Text Encoding Problem
2.3 The Metadata Problem
3 Normative Definitions
4 Backwards Compatibility Issues
4.1 The Extended Header
4.2 Examples
A References (Non-Normative)
The Data Description Header (DDH) is a standard way for files of all types to declare their format and (if appropriate) their Unicode encoding. It is designed to replace a variety of heuristics such as file extensions, 4-byte magic numbers and file system attributes. Each of these heuristic solutions has a weakness that prevents it from being universally used. The DDH is embedded in the same data stream as the data content so that it will not get separated from it under normal circumstances.
The DDH is designed to be as unobtrusive as possible, and yet extensible to handle more complex problems. Here are a few examples of gradually increasing complexity.
Declaring a stream to be a Java program:
<?application/java?>
Declaraing a stream to be a Python 2.3 program:
<?application/python version="2.3"?>
Declarating a a stream to be a Shift-JIS encoded Perl 6 program:
<?application/perl version="6" encoding="Shift-JIS"?>
Declarating a stream to be an RTF document, including documentation about the RTF specifrication:
<?text/rtf version="1.5" encoding="ASCII" DocURI="http://www.biblioscape.com/rtf15_spec.htm"?>
Declaring a stream to be a (binary) Zip file.
<?application/zip version="1.0" encoding="ASCII" dataEncoding="binary" DocURI="http://www.pkware.com/products/enterprise/white_papers/appnote.html"?>
Adding some metadata to the Zip file:
<?application/zip version="1.0" encoding="ASCII" dataEncoding="binary" DocURI="http://www.pkware.com/products/enterprise/white_papers/appnote.html"?>
<xml:meta
xmlns:dc="http://purl.org/dc/elements/1.0/"
xmlns:zip="http://www.pkware.com/zip"><dc:Identifier id="uid">261F1B8C-6C81-11D3-8BFC-0050E4009B3F</dc:Identifier>
<dc:Title>Open eBook Publication Structure 1.0.1</dc:Title>
<dc:Creator role="author">Paul Prescod</dc:Creator><zip:manifest> ...</zip:manifest>
Embedding a header in a pre-existing file format without explicit support for this DDH
#!/usr/bin/python # <?application/python?>
There are three related problems that this specification is designed to solve.
Given a pointer to a file or other bitstream, how can the file's format be determined? The three most popular ways of doing this today are incomplete for a variety of reasons.
File Extensions are a usability nightmare. Users do not understand why a part of the document's "title" has special meaning to the computer. There is no way for them to intuit that it is fine to change the data before the period but not the data after it. Some would-be "power users" think that they can convert the file by changing the extension. Even when the file extension is genuinely useful (as on the command line) there is no way for the system to verify that the data thinks it is the same type that the user does. DDH can be used alongside file extensions to help users detect and recover from errors wherein files are given the wrong extension.
File extensions are also not very granular. They tend to be reused across many versions of a format even if the new versions are not backwards compatible with the old ones.
Unix-style magic numbersare designed to be non-invasive and work out of the box with pre-existing file formats and unchanged software. They do an admirable jobs within those limitations. DDH is a solution that discards those limitations in order to create a better long-term solution. DDH is therefore a much more difficult solution to deploy. It will clearly be years before most files use DDH and may never be the case that every single file does. Luckily, it is quite valuable even if only a minority of files use it.
Magic numbers are primarily used for binary files, not text files. More and more data is in text files these days. But text files are getting more complicated because of Unicode. DDH handles various Unicode encodings.
Even for binary files, the magic number solution has less functionality. First, it does not attempt to solve the metadata problem. Without that, it is impossible for a file to have a URL pointer to the remote plugin that the user might want to use to render it. Second, it uses a model wherein the magic number can be at any place in the file. The system needs to be taught how to look for the differentiating keys in every new file type. In DDH, the header always starts with the first byte of the file.
One could argue that magic numbers are more efficient because they can add only a few bytes to the size of a file (if any). We do not feel that this is a significant issue in real-world files living on real world file systems. For very large files, the header adds a relatively tiny overhead. Files are almost always stored in fixed-sized blocks. For very small files, the header will usually fit in the same block as the rest of the data.
Extended file system attributes can be used to keep track of a file's type on a particular computer, but they are inevitably lost when the data is transmitted through a protocol or file system that does not understand them. Unfortunately, this is the case with most file systems, archiving syntax and protocols. This happens because the data is not self-describing. If the data and its type are supposed to always travel together it makes sense to store them together. This greatly lessens the chance that they will be separated.
Second, if the file is a text file in one of the popular Unicode encodings like UTF-8 or ISO-8859-1, is there a way that this can be communicated to applications that work at a purely textual level such as text editors, text search engines and text transformers? This specification allows a text file to declare its encoding so that various text-level tools will know how to decode them from bytes to Unicode characters. This is a more realistic solution to the problem of recognizing various text encodings than mandating one world-wide or wrapping all text documents in XML tags.
Third, whether or not its file format is known to a particular application, is it possible for that application to read or write descriptive metadata from the file? For instance, it is entirely logical for a user's desktop to know who is the author of a file, or understand the copyright information about a file without fully undestanding the nature of the data. The current solution for this problem is the extended file system attribute. We have already discussed the problem with keeping the metadata separate from the data. In any case, there is no need to choose between in-file metadata and file system-based metadata. In-file metadata is best when data is intended to travel with the file. An authors name might be in-file metadata. File system-based metadata is best for data that only makes sense within the scope of a single system, such as the authors local username.
That said, there are performance reasons that it might be preferable on writing to use a separate block for metadata on large files. Perhaps future operating systems could magically keep metadata in a separate own block and yet present a single stream interface to applications for reading and transmitting over the network.
Because extended attributes are not widely supported and not sufficiently reliable when they are supported, most file formats have invented various ad hoc ways to handle metadata. Adobe has also proposed a generic solution to this problem called XMP. This proposal does not compete with XMP. In fact, it makes XMP metadata easier to find in the file.
A stream is a sequence of bytes starting with a header defined by this specification.
| [1] | document | ::= | (header|extendedHeader) separator body |
A header is a stream of bytes in some Unicode encoding (including historical national encodings such as ASCII, Shift-JIS, etc.). The algorithm for auto-detecting the encoding is the same as that for XML. Just as with XML, the string "<?" is a reliable key for lookup in an encodings table.
A header is a stream of bytes in some Unicode encoding (including historical national encodings such as ASCII, Shift-JIS, etc.). The algorithm for auto-detecting the encoding is the same as that for XML. These production (and all referenced by it) refer to the post-decoding character sequence.
| [2] | header | ::= | declaration metadata? |
The type declaration states the media-type, Unicode encoding of the header, data encoding of the body (if different).
| [3] | typeDeclaration | ::= | '<?' typeID? versionInfo? encodingDecl? docURI? dataEncodingDecl?' ?>' |
The datatype declaration declares the overall format of the data stream. For example, it could be a Python module or a Word document. It can be a straightforward media-type (as defined in HTTP and MIME or an indentifying URI.
| [4] | Type Identifier | ::= | mediaType | TypeURI |
The TypeURI is a type identifier in URI rather than MIME syntax.
Ideally, it can be dereferenced to return information that could be both human and machine readable. Two media types with different TypeURIs are presumed to be different for the purposes of this specification (just as if they were declared with two distinct MIME types).
Note:
The editor hopes that there will arise an RDF schema that can express proper subtype/supertype relationships between formats so that applications can recognize when a document is "similar enough" to what they understand to be processable.
The DocURI is a pointer to human or machine readable documentation about the data format. It is distinguished from the TypeURI in thatit is not considered an identifier. You could point to one URI for information about the ZIP file format and I could point to another.
VersionInfo is any string that meets the XML production of the same name. Its meaning is assigned by the description of the MIME type and not constrained by this specification.
The Encoding declaration is as defined in XML. Just as in XML, it is optional if the data is UTF-8 or UTF-16 or in the rare case that the encoding can be reliably inferred from the stream's context.
The DataEncodingDecl is a pseudo-attribute named "dataEncoding". It defines the Unicode encoding not for the header but for the Body. The value "binary" is used to indicate that no Unicode decoding should be attempted for the Body. If the DataEncodingDecl is omitted, it defaults to the same encoding as the header (which may have been inferred to be UTF-8 or UTF-16).
| [5] | dataEncodingDecl | ::= | 'dataEncodingDecl' S? = S? encoding | "binary" |
The metadata part of the header (if provided) should be a well-formed XML document (with or without an XML declaration). The root element must have the element type "xml:meta". Each sub-element or root-level attribute must be in a namespace. Processors (including editors) should ignore elements or attributes in namespaces they are not programmed to recognize. They SHOULD not remove unknown elements or attributes in namespaces they do not understand. If the XML declaration has an encoding declaration, it MUST NOT be different than that for the header as a whole.
If the Body is in a different encoding than the header (typically binary) then the separator must be the character sequence FF, SUB, EOT (aka "^L^Z^D" aka "FORM FEED", "SUBSTITUTE", "END OF TRANSMISSION") which should serve to visually separate the text from the binary data in the terminal programs of most computers.
Otherwise, the separator is considered to be empty and the body begins with the first character in the stream that could not be interepreted as part of the XML document (i.e. is outside of markup and is neither whitespace nor "<"). This necessarily precludes the body from starting with "<" or whitespace. New formats MUST be defined accordingly.
Note:
This requirement is not burdensome. At worst, a format could define its own separator outside of the scope of this specification and discard the separator before processing the body.
| [6] | separator | ::= | (#0C #1A #03)? |
This specification defines a mechanism called the "extended header". It is designed to support pre-existing uses for the first lines of files. This specification does not change the definition of any pre-existing media types. They should be interpreted as per their various specifications. For example, most Unix systems will not support UTF-16 shell scripts even though this specification might allow such a declaration.
The specification does, however, allow the addition of metadata to those media types for software applications that understand this specification.
It is anticipated that specifications for new formats will make normative references to this one so that this mechanism can replace the various ad hoc mechanisms for self-description and inline metadata. The extended header is merely an interim solution for older formats.
The extended header defines syntactic variations of the base header that are allowed for file formats designed before XDH (for instance Unix shell scripts). There are many such file formats and implementing all of them would take a substantial effort. Therefore, a few common conventions are supported, but not all. Furthermore, formats using the extended headers are restricted to being supersets of UTF-8 and UTF-16.
| [7] | extendedHeader | ::= | shebangLine? CCommentStart? header CCommentEnd? |
| [8] | shebangLine | ::= | '#!' [^#xA]* #xA |
| [9] | CCommentStart | ::= | '/*' S? |
| [10] | CCommentEnd | ::= | S? '*/ |
In an extended header, any line may begin with a shellComment or CPPComment. If so, the characters matching the leading comment are ignored and the data is treated as if it did not exist.
| [11] | shellComment | ::= | S? ('#' S?)+ |
| [12] | CPlusComment | ::= | S? ("//" S?)+ |
Note that these comments are treated as discardable by the legacy programming language or processor, not by the XDD processor or generator.