We are happy to announce that Skyhook Data Management is now officially a part of the Apache Arrow project mainline and is planned to be included in release 7.0.0. SkyhookDM is a plugin for offloading computations involving data processing operations into the storage layer of distributed and programmable object storage systems. It is being developed and maintained by researchers at the Centre for Research in Open Source Software, Baskin School of Engineering, University of California, Santa Cruz. The goal of Skyhook is to reduce client-side resource utilization in terms of CPU, memory bandwidth, and network utilization by offloading data management and processing tasks to the storage layer. We use Ceph, a petabyte-scale distributed object storage system as the storage layer for Skyhook since it provides an excellent object-store extension mechanism with its Object class SDK. On the client side, we use the Arrow Dataset API to expose the functionality. Our implementation is within Ceph but is not Ceph specific, rather it is applicable to any storage system with similar programmability features such as user-defined object classes and partial read/write of objects.
The design of Skyhook allows extending the client and storage layers of distributed object storage systems with widely-used data access libraries requiring minimal modifications. This is achieved by
- Creating a file system shim in the object storage layer so that access libraries embedded in the storage layer can continue to operate on files.
- Offloading selection, projection, and aggregation queries directly to objects using file system striping metadata.
- Mapping datasets to directories of files and files to logically self-contained objects by using standard file system striping.
On a high level, Skyhook has three architectural components:
- Storage-layer: On the storage-layer, Skyhook uses the Ceph Object Class SDK to embed functions inside the RADOS object store which uses Apache Arrow libraries to scan objects containing structured data in file formats like Parquet and Feather. A shim is used to provide a file-like view of objects to Arrow access libraries so that the Arrow APIs can be used out-of-the-box.
- Client-layer: On the client-layer, Skyhook is exposed as a SkyhookFileFormat API which is actually an extension of the Arrow Dataset FileFormat API in the sense that it allows offloading scan operations of file formats supported by Arrow. The SkyhookFileFormat API uses metadata provided by the CephFS layer to discover the objects that need to be scanned and launches scan operations on each of the objects directly, bypassing the POSIX layer.
- Protocol: This component of Skyhook includes scan request definitions and request/response serialization/deserialization functions that the SkyhookFileFormat uses internally to communicate with the storage backend.
Skyhook can be built along with Apache Arrow by using the ARROW_SKYHOOK=ON CMake flag. Doing this builds both the Ceph CLS extensions and the Dataset API extensions. Two different Skyhook-specific shared libraries are generated:
- libarrow_skyhook_client.so: When using the SkyhookFileFormat API in a standalone code, the program needs to be linked with the arrow_dataset, arrow, and arrow_skyhook_client shared libraries during compiling.
- libcls_skyhook.so: For Skyhook to work with Ceph, this shared library needs to be copied to a particular path (/usr/lib/rados-classes) in every Ceph storage node for the Ceph OSDs to pick up and load the object class extensions.
Now that you understand the basics of SkyhookDM, you can follow this guide to deploy a Skyhook cluster and run a very simple query. The Skyhook source code can be found here. For more details on Skyhook, you can go through our Arxiv paper here (more publications on the way). For any Skyhook-related questions, feel free to reach out to us at the Apache Arrow community. We will be more than happy to help. Thanks and Cheers!