Data profiling is the task of extracting metadata from databases. It is an important prerequisite to a lot of common data management tasks like data integration, schema redesign, and data cleansing. While the calculation of simple types of metadata such as the minimum and maximum value of a data attribute is relatively simple, it is incomparibly harder to obtain more complex metadata like functional dependencies. In fact, many of dependency discovery problems are subject to an exponentially growing search space, leading to extremely time- and space-demanding calculations - or even to infeasibility. In addition, real-world data are ever-growing. To this end, we want to leverage the opportunities of distributed computation platforms, e.g. resource scalability, to enable data profiling even on very large datasets. However, in contrast to previous research, our algorithms need to be designed for scale-out, which brings up a lot of new challenges, e.g., the lack of shared state and complex control flow handling.
Meteor is an operator-oriented, extensible query language, which uses a Json-like data model to support applications that analyze semi- and unstructured data. Users may load new operators from operator packages specifically target to certain application areas, which are provided by package developers. The operators from different packages may be freely combined to enable users to write complex analytical queries.
With the data cleansing and integration package, users may declaratively specify complex workflows that involve ad-hoc data cleansing and integration. We showcase the new operators in the large Open Government Integration project GovWILD, which integrates 3 official US data sources and the community-curated Freebase into one consistent data source about spendings with their sponsors and recipients.