Seqdb_formats
Wink Container File Formats
Design goals
When this text is written, we look for a new, improved way to store
html content and digests. The old way stored these data in a hierarchy
of 64K directories, and the content/digest for every document had its
own file (or even several files). In a performance test this storage
format turned out to be a big problem. In particular, we couldn't
control in detail how the page cache was utilized. It should be noted
that the files were stored on the same nodes as the searcher indexes,
and that the searcher performance quickly decreased when we started to
write files. After lots of experiments, it became obvious that this
had to do with the way meta data was handled in the page cache.
Design goals:
- Be page cache friendly! When files are written, it is very unlikely
that they are needed in the near future. Ideally, the page cache
shouldn't be touched at all.
This rules out all storage formats that base on directories, and use
separate files for each document. The doc corpus should rather be
stored in only a few files, so we can use the fadvise system call to
control the page cache. (That's basically the lesson we learned
from our performance test.)
- Be simple and safe! It should be avoided that the storage format
can be globally corrupted. It is ok when single files have bad
contents after crashes.
- Avoid disk fragmentation! In the current format, it is an expensive
operation to move data from one node to another, because the file
tree is very fragmented. By storing in contiguous blocks, the data
can be efficiently read and written as a whole.
- Fast lookups! For digests, we need fast lookups. As digests are
quite small in the average, we do not need support for mmap.
- Full support for filesystem operations. In particular, we need the
ability to read and writes only parts of files (and not the whole
file), and to append to files (which is a special requirement
of the peoplesearcher - a document is not only one download,
but often several, and they need to be concatenated somehow).
This rules out all dbm implementations - they don't support these
operations efficiently.
- The filesys format should be built on top of more basic formats that
can be reused outside of this context. In particular, we have our
map/reduce implementation in mind when we think about reuse.
Compromises we accept:
- It is ok that space is not automatically reclaimed when it is
no longer used. We accept special compaction procedures that need
to be run from time to time.
Basic Formats
Three file formats are suggested:
- kvseq: stores sequences of keys and values of arbitrary size
- hindex: is a hash table of pointers to kvseq entries
- perm: is a sequence of pointers to kvseq entries
All three formats have a common superblock at the beginning. This
superblock contains settings for the file format, and user-defined
variables.
All integers are stored in network byte order.
Superblock Format
The superblock has size SBSIZE bytes. SBSIZE can be varied, but once
the superblock is created for the file, SBSIZE is fixed.
- The first eight bytes are the magic "#!WINKME"
- Then an array of
string[8] * int64 pairs follows. These pairs
are the variables that are stored in the superblock. Every variable
has a name that is an 8-byte string (padded with spaces). The value
is a 64 bit integer.
- Finally, eight zero bytes end the superblock.
The very first variable must be SBSIZE. This means the size of the
superblock is always stored in bytes 16..23.
The following variables are used by all file formats:
- SBSIZE: See above. It is always the first variable.
- FORMAT: Says which format is used. It is always the second
variable.
- PURPOSE: Indicates for which purpose the file is used.
The int64 number is interpreted as 8 characters.
PURPOSE is always the third variable. PURPOSE is a user-defined
string.
- FILESIZE: The variable contains the position of the logical
end of the file. The file can actually be longer, but the
contents are only interpreted up to FILESIZE-1. This allows
to allocate file blocks before they are used in order to
avoid fragmentation.
- FILEINCR: Suggested increment of the real file size if more
space is needed.
- SYNCSIZE: The FILESIZE at the time of the last sync (optional).
- SYNCTIME: The time of the last sync (in seconds since the epoch,
optional)
kvseq format
The variable FORMAT is 0x10.
The list of keys and values starts after the superblock and
ends at FILESIZE-1. Logically, this is a list of
(delflag, key, value) or (key,value) entries where
- delflag marks deleted entries (exists only if KVDELFL is 1)
- key is the key string
- value is the value string
The list is stored by concatenating the representations of the
entries. Every entry is represented by concatenating the
representations of the delflag (if existing), the key, and the value.
The representation of the delflag is always 1 byte that can be 0 or
1. 0 means not deleted, 1 means deleted. The delflag is omitted if the
variable KVDELFL does not exist or is not equal to 1.
The representation of key and value depend on superblock variables.
If KEYREPR is < 4, keys have variable length. The length precedes
the key:
- KEYREPR=0: The length is stored as int8
- KEYREPR=1: The length is stored as int16
- KEYREPR=2: The length is stored as int32
- KEYREPR=3: The length is stored as int64
If KEYREPR is >= 4 and <= 259, keys have the fixed length KEYREPR-4
bytes.
If KEYREPR is >= 260 and <= 514, keys have a variable length of 0 to
KEYREPR-259 bytes, but a fixed amount of KEYREPR-259 bytes are
allocated for the key string. One byte containing the key length
precedes the key, followed by the key, followed by P null bytes
padding the key, P = KEYREPR-259-N when N is the length of the key.
KEYREPR > 514 is currently undefined.
The variable VALREPR determines the representation of the values
in the same way.
The file format also supports alignment. If the variable ALIGN
exists and is > 0, entries start at multiples of ALIGN.
Further variables that are updated after every modification
(but only if they already exist in the superblock):
- ENTRIES: The total number of entries, including deleted entries
- AENTRIES: The number of non-deleted entries
The superblock may also contain HTALGO as defined for hindex. This is
just a hint for the best hash algorithm for this data.
Operations: Entries are identified by a file pointer (kv_pointer)
which is simply the 64 bit offset of the entry.
- append: append a new entry to the end
- replace: replace an entry. The new entry must be representable in
the file. This means: unless it is the last entry, the new entry
must have the same size.
- delete: mark an entry as deleted
- iter: iterate over the entries in file order
hindex format
The variable FORMAT is 0x20.
The hash table is stored after the superblock, and is simply an array
of HTSIZE integers. Each integer has a bit size which is a multiple
of 64 bits. The first 64 bits of these integers are usually interpreted as
kv_pointers. The remaining bits, if any, can be used to store additional
data per cell.
The superblock defines constants:
- CELLSZ: How long every cell is, in multiples of 64 bits. The
value must be >= 1, i.e. the minimum cell length is 64 bits.
If CELLSZ is missing, a cell size of 64 bits is assumed.
The are two special constants:
- HTFREE: This value is stored in hash table cells to denote
free cells.
- HTDEL: This value is stored in hash table cells to denote
deleted cells.
The hash table uses linear probing to resolve conflicts.
The variable HTALGO defines the algorithm to compute the hash index
from the key:
- Algorithm 1: The MD5 of the key is taken, and the last 63 bits
are taken modulo HTSIZE.
(Further algorithms to be defined.)
The superblocks contains also these values which are updated after
every hash table operation (if they already exist in the superblock):
- ENTRIES: The total number of table cells that are not HTFREE.
- AENTRIES: The number of table cells that are neither HTFREE nor HTDEL.
Operations:
- insert: insert a new entry, return the position in the table
- delete: mark an entry as deleted
- hash: return the position for a given key
- get: return the cell contents for a given position
hindex with stored hashes
This format is a special hindex where CELLSZ = 2. For every used
cell, the second 64 bits store the first 64 bits of the MD5 of the
key. For free and deleted cells, the second 64 bits are zero.
perm format
The variable FORMAT is 0x30.
The permutation array is stored after the superblock, and is simply an
array of PERMSIZE pairs of (int64 * int64) integers. The left integers
are usually interpreted as kv_pointers. The right integers are
interpreted as hash values of keys.
The variable HTALGO defines the hash algorithm (with parameter
HTSIZE).
Operations:
- append: Add a new entry to the end
- get: get the pair at index i
- swap: swap the pairs at index i,k
- group: sort the pairs by hash values. To compare two indentical
hash values h1 = h2, one has to pass in a custom comparison function
that usually compares the corresponding keys.
- sort: sort the pairs by a custom comparison function
Magic numbers
Tests for /etc/magic:
0 string #!WINKME Wink peoplesearcher file,
>32 string \0\0\0\0\0\0\0\x10 kvseq format,
>>48 byte >0 %c
>>49 byte >0 \b%c
>>50 byte >0 \b%c
>>51 byte >0 \b%c
>>52 byte >0 \b%c
>>53 byte >0 \b%c
>>54 byte >0 \b%c
>>55 byte >0 \b%c
>32 string \0\0\0\0\0\0\0\x20 hindex format
>>48 byte >0 %c
>>49 byte >0 \b%c
>>50 byte >0 \b%c
>>51 byte >0 \b%c
>>52 byte >0 \b%c
>>53 byte >0 \b%c
>>54 byte >0 \b%c
>>55 byte >0 \b%c
>32 string \0\0\0\0\0\0\0\x30 perm format
>>48 byte >0 %c
>>49 byte >0 \b%c
>>50 byte >0 \b%c
>>51 byte >0 \b%c
>>52 byte >0 \b%c
>>53 byte >0 \b%c
>>54 byte >0 \b%c
>>55 byte >0 \b%c
Derived Formats
The following combinations are immediately possible:
- kvseq with hindex: This is a kvseq indexed by keys
- kvseq with perm: This is a kvseq in a permuted order
filesys format
A filesystem stores files. It is basically a mapping
file_name -> contents
where file_name is a simple string (no subdirectories). We need to
support these operations:
- create: Create new file
- append: Append to existing file
- delete: Delete file
- rename: Rename a file
- stat: Get information about file
- read: Read (parts of) a file
- iter: read file by file
Filesys are represented by two files:
- datafile: This is a kvseq with two kinds of entries:
- An inode entry has the key "<file_name>/I<n>" where <n> is a number
enumerating all inode entries for a given file, from 0 to NI-1.
The value of the inode entries has the format described below.
Essentially, the inode entries contain a list of file pointers
to the data entries of the file.
- A data entry has a key "<file_id>/D<n>" where <n> is a number
enumerating all data entries for a given file, from 0 to ND-1,
and <file_id> is a hex number identifying the inode (also
stored in the inode).
The concatenation of the values of the data entries gives the
data part of the file. The bytes 0 to LSIZE-1 of the data part
are the contents of the file. Exceeding bytes remain uninterpreted.
- indexfile: This is an hindex for the datafile. Only the
keys "<file_name>/I0", i.e. the first inode entries, are indexed.
A deleted file is marked as deleted in both datafile and indexfile.
The inode stores a few properties about the file, and kv_pointers to the
data entries.
For the moment, we always have NI=1, i.e. only one inode per file. If
the inode becomes too small, the old inode is deleted, and a new inode
is allocated (that is twice as big).
The inode contains:
- CKSUM: Checksum of all following fields (except the padding at the
end)
- NEXTI: Pointer to the next inode (currently always 0 because NI=1)
- FILEID: The file ID. This is a number that identifies the inode
uniquely in the datafile. The file ID remains the same when the
file is renamed.
- LSIZE: logical file size in bytes (int64)
- FTYPE: file type (int64), but only values 0-255
- FMTIME: file mtime (int64)
- DCOUNT: the number of data entries the file consists of
- an array of pairs (kv_pointer, data_size) pointing to the data strings
(both kv_pointer and data_size are represented as int64)
- the rest of the inode is padded with 0 bytes
The superblock of the datafile contains a few additional variables:
- ISZ: The default size of new inodes
- ITOTSZ: The total size of the valid inodes
- DTOTSZ: The total size of all files
- HAVEDUPS: 0 or 1. If 1, it is allowed to use the same file name
several times. However, only the version of the file counts that
occurs later in the file (=at a later position), and only this
version is in the index. (This option has performance advantages
when CELLSZ=2 index files are used.)
The superblock of both files sets PURPOSE to:
- "FSYSDATA" for the datafile
- "FSYSIDX" for the indexfile
Operations for filesys:
- create a file: create the initial inode, and append it to the inodefile.
No data strings are allocated. Add an entry to the indexfile.
- delete a file: look up the inode using the index. Iterate over all
data strings, and mark them as deleted. Mark the inode as deleted.
Mark the index entry as deleted.
- append to file: look up the inode using the index. Add a data string
to the data file. If the pointer to the data string fits into the
inode, add it, and we are done. Otherwise, resize the inode, and
add the pointer then.
- rename a file: one has to allocate a new inode for the new name,
and also rename in the index
- iter: Iterate over the datafile, and skip (a) deleted files, (b)
data entries.
- reindex: It is possible to drop the indexfile, and build a new
index only from the datafile
filesys variant "append only" (ao)
This variant only allows appending to the data file. Special provisions
are taken so the filesys can be easily iterated in disk order, without
exposing file positions:
- The mtime field of files is immutable in the ao variant, and so the
mtime reflects the time the file was created.
- As a consequence, the files are always stored in ascending mtime
order. Because of this, the mtime is used as secondary key of the
files, so one can request: Iterate over all files whose mtime is
between a certain range.
- A third file is written, the time index (.time). This is in fact a
kvseq file mapping mtime to file positions. However, not every mtime
appears in this file that occurs as mtime of a file, but only
selected mtimes. These are called time marks. The time marks are
written periodically, i.e. every TIMEMP seconds.
The superblock of the datafile contains additional fields:
- HAVEAO: This is 1 to indicate that the ao variant is in effect
The timefile is a kvseq file with these superblock fields:
- PURPOSE: "FSYSTIME"
- TIMEMP: every how many seconds a new time mark is written. This is done
when the first file is appended for which mtime mod TIMEMP = 0.
- TIMEMARK: The last written time mark
- TIMEMN: how many files have been appended since the last time mark.
- LASTMT: the mtime of the last file
The kvseq entries:
- The keys are the time marks as 64 bit integers (big endian, as usual)
- The values are the file positions as 64 bit intergers (big endian too)
Some further notes about this format:
- The syncing of the timefile should happen at the same time as
syncing of the datafile
- The inodes cannot be resized! The initial inode size (ISZ) should
be large enough to hold the meta data of every possible file.