In order to fulfil its main goal, namely advancing the state of the art distributed, fault-tolerant, and adaptive data processing, the Stratosphere project covers a broad range of research topics, starting from fundamental concerns like fault tolerance, over flexible programming models for distributed computations, to concrete use-cases on top of distributed systems like focussed crawling. This page gives an overview of the different subprojects of Stratosphere.

Distributed Data Profiling

Research associate: Sebastian Kruse, Information Systems Group, HPI

Data profiling is the task of extracting metadata from databases. It is an important prerequisite to a lot of common data management tasks like data integration, schema redesign, and data cleansing. While the calculation of simple types of metadata such as the minimum and maximum value of a data attribute is relatively simple, it is incomparibly harder to obtain more complex metadata like functional dependencies. In fact, many of dependency discovery problems are subject to an exponentially growing search space, leading to extremely time- and space-demanding calculations - or even to infeasibility. In addition, real-world data are ever-growing. To this end, we want to leverage the opportunities of distributed computation platforms, e.g. resource scalability, to enable data profiling even on very large datasets. However, in contrast to previous research, our algorithms need to be designed for scale-out, which brings up a lot of new challenges, e.g., the lack of shared state and complex control flow handling.

Query Language Meteor completed

Research associate: Arvid Heise, Information Systems Group, HPI

Meteor is an operator-oriented, extensible query language, which uses a Json-like data model to support applications that analyze semi- and unstructured data. Users may load new operators from operator packages specifically target to certain application areas, which are provided by package developers. The operators from different packages may be freely combined to enable users to write complex analytical queries.

Data Cleansing and Integration Operators completed

Research associate: Arvid Heise, Information Systems Group, HPI

With the data cleansing and integration package, users may declaratively specify complex workflows that involve ad-hoc data cleansing and integration. We showcase the new operators in the large Open Government Integration project GovWILD, which integrates 3 official US data sources and the community-curated Freebase into one consistent data source about spendings with their sponsors and recipients.