Seqdb_formats



Wink Container File Formats

Design goals

When this text is written, we look for a new, improved way to store html content and digests. The old way stored these data in a hierarchy of 64K directories, and the content/digest for every document had its own file (or even several files). In a performance test this storage format turned out to be a big problem. In particular, we couldn't control in detail how the page cache was utilized. It should be noted that the files were stored on the same nodes as the searcher indexes, and that the searcher performance quickly decreased when we started to write files. After lots of experiments, it became obvious that this had to do with the way meta data was handled in the page cache.

Design goals:

Compromises we accept:

Basic Formats

Three file formats are suggested:

All three formats have a common superblock at the beginning. This superblock contains settings for the file format, and user-defined variables.

All integers are stored in network byte order.

Superblock Format

The superblock has size SBSIZE bytes. SBSIZE can be varied, but once the superblock is created for the file, SBSIZE is fixed.

The very first variable must be SBSIZE. This means the size of the superblock is always stored in bytes 16..23.

The following variables are used by all file formats:

kvseq format

The variable FORMAT is 0x10.

The list of keys and values starts after the superblock and ends at FILESIZE-1. Logically, this is a list of (delflag, key, value) or (key,value) entries where

The list is stored by concatenating the representations of the entries. Every entry is represented by concatenating the representations of the delflag (if existing), the key, and the value.

The representation of the delflag is always 1 byte that can be 0 or 1. 0 means not deleted, 1 means deleted. The delflag is omitted if the variable KVDELFL does not exist or is not equal to 1.

The representation of key and value depend on superblock variables. If KEYREPR is < 4, keys have variable length. The length precedes the key:

If KEYREPR is >= 4 and <= 259, keys have the fixed length KEYREPR-4 bytes.

If KEYREPR is >= 260 and <= 514, keys have a variable length of 0 to KEYREPR-259 bytes, but a fixed amount of KEYREPR-259 bytes are allocated for the key string. One byte containing the key length precedes the key, followed by the key, followed by P null bytes padding the key, P = KEYREPR-259-N when N is the length of the key.

KEYREPR > 514 is currently undefined.

The variable VALREPR determines the representation of the values in the same way.

The file format also supports alignment. If the variable ALIGN exists and is > 0, entries start at multiples of ALIGN.

Further variables that are updated after every modification (but only if they already exist in the superblock):

The superblock may also contain HTALGO as defined for hindex. This is just a hint for the best hash algorithm for this data.

Operations: Entries are identified by a file pointer (kv_pointer) which is simply the 64 bit offset of the entry.

hindex format

The variable FORMAT is 0x20.

The hash table is stored after the superblock, and is simply an array of HTSIZE integers. Each integer has a bit size which is a multiple of 64 bits. The first 64 bits of these integers are usually interpreted as kv_pointers. The remaining bits, if any, can be used to store additional data per cell.

The superblock defines constants:

The are two special constants:

The hash table uses linear probing to resolve conflicts.

The variable HTALGO defines the algorithm to compute the hash index from the key:

(Further algorithms to be defined.)

The superblocks contains also these values which are updated after every hash table operation (if they already exist in the superblock):

Operations:

hindex with stored hashes

This format is a special hindex where CELLSZ = 2. For every used cell, the second 64 bits store the first 64 bits of the MD5 of the key. For free and deleted cells, the second 64 bits are zero.

perm format

The variable FORMAT is 0x30.

The permutation array is stored after the superblock, and is simply an array of PERMSIZE pairs of (int64 * int64) integers. The left integers are usually interpreted as kv_pointers. The right integers are interpreted as hash values of keys.

The variable HTALGO defines the hash algorithm (with parameter HTSIZE).

Operations:

Magic numbers

Tests for /etc/magic:

0     string		#!WINKME		Wink peoplesearcher file,
>32   string		\0\0\0\0\0\0\0\x10	kvseq format,
>>48  byte		>0			%c
>>49  byte		>0			\b%c
>>50  byte		>0			\b%c
>>51  byte		>0			\b%c
>>52  byte		>0			\b%c
>>53  byte		>0			\b%c
>>54  byte		>0			\b%c
>>55  byte		>0			\b%c
>32   string		\0\0\0\0\0\0\0\x20	hindex format
>>48  byte		>0			%c
>>49  byte		>0			\b%c
>>50  byte		>0			\b%c
>>51  byte		>0			\b%c
>>52  byte		>0			\b%c
>>53  byte		>0			\b%c
>>54  byte		>0			\b%c
>>55  byte		>0			\b%c
>32   string		\0\0\0\0\0\0\0\x30	perm format
>>48  byte		>0			%c
>>49  byte		>0			\b%c
>>50  byte		>0			\b%c
>>51  byte		>0			\b%c
>>52  byte		>0			\b%c
>>53  byte		>0			\b%c
>>54  byte		>0			\b%c
>>55  byte		>0			\b%c

Derived Formats

The following combinations are immediately possible:

filesys format

A filesystem stores files. It is basically a mapping

file_name -> contents

where file_name is a simple string (no subdirectories). We need to support these operations:

Filesys are represented by two files:

A deleted file is marked as deleted in both datafile and indexfile.

The inode stores a few properties about the file, and kv_pointers to the data entries.

For the moment, we always have NI=1, i.e. only one inode per file. If the inode becomes too small, the old inode is deleted, and a new inode is allocated (that is twice as big).

The inode contains:

The superblock of the datafile contains a few additional variables:

The superblock of both files sets PURPOSE to:

Operations for filesys:

filesys variant "append only" (ao)

This variant only allows appending to the data file. Special provisions are taken so the filesys can be easily iterated in disk order, without exposing file positions:

The superblock of the datafile contains additional fields:

The timefile is a kvseq file with these superblock fields:

The kvseq entries:

Some further notes about this format: