{1 Wink Container File Formats} {2 Design goals} When this text is written, we look for a new, improved way to store html content and digests. The old way stored these data in a hierarchy of 64K directories, and the content/digest for every document had its own file (or even several files). In a performance test this storage format turned out to be a big problem. In particular, we couldn't control in detail how the page cache was utilized. It should be noted that the files were stored on the same nodes as the searcher indexes, and that the searcher performance quickly decreased when we started to write files. After lots of experiments, it became obvious that this had to do with the way meta data was handled in the page cache. Design goals: {ul {- Be page cache friendly! When files are written, it is very unlikely that they are needed in the near future. Ideally, the page cache shouldn't be touched at all. This rules out all storage formats that base on directories, and use separate files for each document. The doc corpus should rather be stored in only a few files, so we can use the fadvise system call to control the page cache. (That's basically the lesson we learned from our performance test.)} {- Be simple and safe! It should be avoided that the storage format can be globally corrupted. It is ok when single files have bad contents after crashes.} {- Avoid disk fragmentation! In the current format, it is an expensive operation to move data from one node to another, because the file tree is very fragmented. By storing in contiguous blocks, the data can be efficiently read and written as a whole.} {- Fast lookups! For digests, we need fast lookups. As digests are quite small in the average, we do not need support for mmap.} {- Full support for filesystem operations. In particular, we need the ability to read and writes only parts of files (and not the whole file), and to append to files (which is a special requirement of the peoplesearcher - a document is not only one download, but often several, and they need to be concatenated somehow). This rules out all dbm implementations - they don't support these operations efficiently.} {- The filesys format should be built on top of more basic formats that can be reused outside of this context. In particular, we have our map/reduce implementation in mind when we think about reuse.} } Compromises we accept: - It is ok that space is not automatically reclaimed when it is no longer used. We accept special compaction procedures that need to be run from time to time. {1 Basic Formats} Three file formats are suggested: - kvseq: stores sequences of keys and values of arbitrary size - hindex: is a hash table of pointers to kvseq entries - perm: is a sequence of pointers to kvseq entries All three formats have a common superblock at the beginning. This superblock contains settings for the file format, and user-defined variables. All integers are stored in network byte order. {2 Superblock Format} The superblock has size SBSIZE bytes. SBSIZE can be varied, but once the superblock is created for the file, SBSIZE is fixed. - The first eight bytes are the magic "#!WINKME" - Then an array of [string[8] * int64] pairs follows. These pairs are the variables that are stored in the superblock. Every variable has a name that is an 8-byte string (padded with spaces). The value is a 64 bit integer. - Finally, eight zero bytes end the superblock. The very first variable must be SBSIZE. This means the size of the superblock is always stored in bytes 16..23. The following variables are used by all file formats: - SBSIZE: See above. It is always the first variable. - FORMAT: Says which format is used. It is always the second variable. - PURPOSE: Indicates for which purpose the file is used. The int64 number is interpreted as 8 characters. PURPOSE is always the third variable. PURPOSE is a user-defined string. - FILESIZE: The variable contains the position of the logical end of the file. The file can actually be longer, but the contents are only interpreted up to FILESIZE-1. This allows to allocate file blocks before they are used in order to avoid fragmentation. - FILEINCR: Suggested increment of the real file size if more space is needed. - SYNCSIZE: The FILESIZE at the time of the last sync (optional). - SYNCTIME: The time of the last sync (in seconds since the epoch, optional) {2 kvseq format} The variable FORMAT is 0x10. The list of keys and values starts after the superblock and ends at FILESIZE-1. Logically, this is a list of (delflag, key, value) or (key,value) entries where - delflag marks deleted entries (exists only if KVDELFL is 1) - key is the key string - value is the value string The list is stored by concatenating the representations of the entries. Every entry is represented by concatenating the representations of the delflag (if existing), the key, and the value. The representation of the delflag is always 1 byte that can be 0 or 1. 0 means not deleted, 1 means deleted. The delflag is omitted if the variable KVDELFL does not exist or is not equal to 1. The representation of key and value depend on superblock variables. If KEYREPR is < 4, keys have variable length. The length precedes the key: - KEYREPR=0: The length is stored as int8 - KEYREPR=1: The length is stored as int16 - KEYREPR=2: The length is stored as int32 - KEYREPR=3: The length is stored as int64 If KEYREPR is >= 4 and <= 259, keys have the fixed length KEYREPR-4 bytes. If KEYREPR is >= 260 and <= 514, keys have a variable length of 0 to KEYREPR-259 bytes, but a fixed amount of KEYREPR-259 bytes are allocated for the key string. One byte containing the key length precedes the key, followed by the key, followed by P null bytes padding the key, P = KEYREPR-259-N when N is the length of the key. KEYREPR > 514 is currently undefined. The variable VALREPR determines the representation of the values in the same way. The file format also supports alignment. If the variable ALIGN exists and is > 0, entries start at multiples of ALIGN. Further variables that are updated after every modification (but only if they already exist in the superblock): - ENTRIES: The total number of entries, including deleted entries - AENTRIES: The number of non-deleted entries The superblock may also contain HTALGO as defined for hindex. This is just a hint for the best hash algorithm for this data. Operations: Entries are identified by a file pointer (kv_pointer) which is simply the 64 bit offset of the entry. - append: append a new entry to the end - replace: replace an entry. The new entry must be representable in the file. This means: unless it is the last entry, the new entry must have the same size. - delete: mark an entry as deleted - iter: iterate over the entries in file order {2 hindex format} The variable FORMAT is 0x20. The hash table is stored after the superblock, and is simply an array of HTSIZE integers. Each integer has a bit size which is a multiple of 64 bits. The first 64 bits of these integers are usually interpreted as kv_pointers. The remaining bits, if any, can be used to store additional data per cell. The superblock defines constants: - CELLSZ: How long every cell is, in multiples of 64 bits. The value must be >= 1, i.e. the minimum cell length is 64 bits. If CELLSZ is missing, a cell size of 64 bits is assumed. The are two special constants: - HTFREE: This value is stored in hash table cells to denote free cells. - HTDEL: This value is stored in hash table cells to denote deleted cells. The hash table uses linear probing to resolve conflicts. The variable HTALGO defines the algorithm to compute the hash index from the key: - Algorithm 1: The MD5 of the key is taken, and the last 63 bits are taken modulo HTSIZE. (Further algorithms to be defined.) The superblocks contains also these values which are updated after every hash table operation (if they already exist in the superblock): - ENTRIES: The total number of table cells that are not HTFREE. - AENTRIES: The number of table cells that are neither HTFREE nor HTDEL. Operations: - insert: insert a new entry, return the position in the table - delete: mark an entry as deleted - hash: return the position for a given key - get: return the cell contents for a given position {2 hindex with stored hashes} This format is a special hindex where CELLSZ = 2. For every used cell, the second 64 bits store the first 64 bits of the MD5 of the key. For free and deleted cells, the second 64 bits are zero. {2 perm format} The variable FORMAT is 0x30. The permutation array is stored after the superblock, and is simply an array of PERMSIZE pairs of (int64 * int64) integers. The left integers are usually interpreted as kv_pointers. The right integers are interpreted as hash values of keys. The variable HTALGO defines the hash algorithm (with parameter HTSIZE). Operations: - append: Add a new entry to the end - get: get the pair at index i - swap: swap the pairs at index i,k - group: sort the pairs by hash values. To compare two indentical hash values h1 = h2, one has to pass in a custom comparison function that usually compares the corresponding keys. - sort: sort the pairs by a custom comparison function {2 Magic numbers} - The first eight bytes are always "#!WINKME". - Bytes 32..39 contain the FORMAT variable in network byte order: {[ 00 00 00 00 00 00 00 10 kvseq 00 00 00 00 00 00 00 20 hindex 00 00 00 00 00 00 00 30 perm ]} - Bytes 48..55 contain the PURPOSE variable which is a string of eight characters. Tests for /etc/magic: {[ 0 string #!WINKME Wink peoplesearcher file, >32 string \0\0\0\0\0\0\0\x10 kvseq format, >>48 byte >0 %c >>49 byte >0 \b%c >>50 byte >0 \b%c >>51 byte >0 \b%c >>52 byte >0 \b%c >>53 byte >0 \b%c >>54 byte >0 \b%c >>55 byte >0 \b%c >32 string \0\0\0\0\0\0\0\x20 hindex format >>48 byte >0 %c >>49 byte >0 \b%c >>50 byte >0 \b%c >>51 byte >0 \b%c >>52 byte >0 \b%c >>53 byte >0 \b%c >>54 byte >0 \b%c >>55 byte >0 \b%c >32 string \0\0\0\0\0\0\0\x30 perm format >>48 byte >0 %c >>49 byte >0 \b%c >>50 byte >0 \b%c >>51 byte >0 \b%c >>52 byte >0 \b%c >>53 byte >0 \b%c >>54 byte >0 \b%c >>55 byte >0 \b%c ]} {1 Derived Formats} The following combinations are immediately possible: - kvseq with hindex: This is a kvseq indexed by keys - kvseq with perm: This is a kvseq in a permuted order {2 filesys format} A filesystem stores files. It is basically a mapping file_name -> contents where file_name is a simple string (no subdirectories). We need to support these operations: - create: Create new file - append: Append to existing file - delete: Delete file - rename: Rename a file - stat: Get information about file - read: Read (parts of) a file - iter: read file by file Filesys are represented by two files: {ul {- datafile: This is a kvseq with two kinds of entries: {ul {- An inode entry has the key "/I" where is a number enumerating all inode entries for a given file, from 0 to NI-1. The value of the inode entries has the format described below. Essentially, the inode entries contain a list of file pointers to the data entries of the file.} {- A data entry has a key "/D" where is a number enumerating all data entries for a given file, from 0 to ND-1, and is a hex number identifying the inode (also stored in the inode). The concatenation of the values of the data entries gives the data part of the file. The bytes 0 to LSIZE-1 of the data part are the contents of the file. Exceeding bytes remain uninterpreted.} } } {- indexfile: This is an hindex for the datafile. Only the keys "/I0", i.e. the first inode entries, are indexed. } } A deleted file is marked as deleted in both datafile and indexfile. The inode stores a few properties about the file, and kv_pointers to the data entries. For the moment, we always have NI=1, i.e. only one inode per file. If the inode becomes too small, the old inode is deleted, and a new inode is allocated (that is twice as big). The inode contains: - CKSUM: Checksum of all following fields (except the padding at the end) - NEXTI: Pointer to the next inode (currently always 0 because NI=1) - FILEID: The file ID. This is a number that identifies the inode uniquely in the datafile. The file ID remains the same when the file is renamed. - LSIZE: logical file size in bytes (int64) - FTYPE: file type (int64), but only values 0-255 - FMTIME: file mtime (int64) - DCOUNT: the number of data entries the file consists of - an array of pairs (kv_pointer, data_size) pointing to the data strings (both kv_pointer and data_size are represented as int64) - the rest of the inode is padded with 0 bytes The superblock of the datafile contains a few additional variables: - ISZ: The default size of new inodes - ITOTSZ: The total size of the valid inodes - DTOTSZ: The total size of all files - HAVEDUPS: 0 or 1. If 1, it is allowed to use the same file name several times. However, only the version of the file counts that occurs later in the file (=at a later position), and only this version is in the index. (This option has performance advantages when CELLSZ=2 index files are used.) The superblock of both files sets PURPOSE to: - "FSYSDATA" for the datafile - "FSYSIDX" for the indexfile Operations for filesys: - create a file: create the initial inode, and append it to the inodefile. No data strings are allocated. Add an entry to the indexfile. - delete a file: look up the inode using the index. Iterate over all data strings, and mark them as deleted. Mark the inode as deleted. Mark the index entry as deleted. - append to file: look up the inode using the index. Add a data string to the data file. If the pointer to the data string fits into the inode, add it, and we are done. Otherwise, resize the inode, and add the pointer then. - rename a file: one has to allocate a new inode for the new name, and also rename in the index - iter: Iterate over the datafile, and skip (a) deleted files, (b) data entries. - reindex: It is possible to drop the indexfile, and build a new index only from the datafile {2 filesys variant "append only" (ao)} This variant only allows appending to the data file. Special provisions are taken so the filesys can be easily iterated in disk order, without exposing file positions: - The mtime field of files is immutable in the ao variant, and so the mtime reflects the time the file was created. - As a consequence, the files are always stored in ascending mtime order. Because of this, the mtime is used as secondary key of the files, so one can request: Iterate over all files whose mtime is between a certain range. - A third file is written, the time index (.time). This is in fact a kvseq file mapping mtime to file positions. However, not every mtime appears in this file that occurs as mtime of a file, but only selected mtimes. These are called time marks. The time marks are written periodically, i.e. every TIMEMP seconds. The superblock of the datafile contains additional fields: - HAVEAO: This is 1 to indicate that the ao variant is in effect The timefile is a kvseq file with these superblock fields: - PURPOSE: "FSYSTIME" - TIMEMP: every how many seconds a new time mark is written. This is done when the first file is appended for which mtime mod TIMEMP = 0. - TIMEMARK: The last written time mark - TIMEMN: how many files have been appended since the last time mark. - LASTMT: the mtime of the last file The kvseq entries: - The keys are the time marks as 64 bit integers (big endian, as usual) - The values are the file positions as 64 bit intergers (big endian too) Some further notes about this format: - The syncing of the timefile should happen at the same time as syncing of the datafile - The inodes cannot be resized! The initial inode size (ISZ) should be large enough to hold the meta data of every possible file.