Chunks for each cube are arranged on disk in a given order of dimensions. Aggregation operations can either access them in linear order and optimize for I/O [ZDN97] or access chunks in a dimension order of the dimension being aggregated. In this work we explore both alternatives.
To support large data sets with a large number of dimensions we store data in chunks. BESS is used to store sparse data. Chunks can be dense if the density is above a certain threshold. Disk I/O is used to write and read chunks from disk as and when needed during execution. For each chunk referenced, a small portion of the chunk (called minichunks) is allocated memory initially. When more memory is required, minichunks are written to disk and their memory is utilized. Depending on the density of the dataset considered and the data distribution, an appropriate minichunk size is chosen. Too large a minichunk size will accommodate sparser and fewer minichunks in memory. Too small a size will invoke disk I/O activity, to write the full chunks on to disk. We have experimented with several minichunk sizes and have used a size of 25 (BESS, value) pairs in the performance figures. A file is kept to store the minichunks of the chunks in each cube.
Also, the chunk sizes have to be carefully chosen to balance the size of the overall chunk structure and the representation of the dimension indices for BESS. This also has an effect on the sizes of the files created for each cube, which on a conventional UNIX system are limited to 2 GB. For larger sizes either support for 64 bit long pointers is required or a parallel file system can be used. Further, the chunk structure for each cube, which stores the meta-data information of chunks, their topology, distribution on processors, etc. has to fit in the available memory. If the chunk structure is larger than available memory, which might be the case for a large number of dimensions and large dimension sizes, the chunk structure has to be paged by keeping the chunk information of the current chunks in memory and the rest on disk. This will also help in the chunk-reuse based calculations by memory being allocated to only the parent and child chunks being calculated.