Stratosphere

Big Data looks tiny from here.

Download 0.4 View on GitHub

Stable: 0.4, Beta: 0.5-SNAPSHOT

Stratosphere is the next generation Big Data Analytics Platform.

It combines the strengths of MapReduce/Hadoop with powerful programming abstractions in Java and Scala and a high performance runtime. Stratosphere has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations.

Easy to Install

Download and run Stratosphere programs in less than 5 minutes.

Easy to Use

Beauty of Scala programming: specify what you want out of the data, not how the job is executed.

Advanced Analytics

Iterative, arbitrarily large programs with multiple inputs and outputs.

Run in the Cloud

Instantly deploy Stratosphere on Amazon's EC2 and run your data analysis in the cloud.

Performance

Scale out to large clusters, exploit multi-core processors and in-memory processing.

Empowering Data Scientists

Our optimizer automatically parallelizes and optimizes your programs.


Features

More Operators

Stratosphere extends the well-known MapReduce model with new operators. These operators represent many common data analysis tasks more naturally and efficiently. All operators will start working in memory and gracefully go out of core under memory pressure.


Advanced Data Flow Graphs

Stratosphere allows to model analysis programs as advanced data flow graphs. For many applications, this is a more natural fit than the constrained MapReduce interface (strictly Map followed by Reduce). Furthermore, data pipelining and in-memory data transfers increase performance by drastically reducing disk and network I/O.

See how Hadoop does complex data flows

Complex Data Flows in Hadoop

Executing the plan shown on the left using MapReduce leads to a composition of multiple MapReduce jobs. Intermediate results are stored in HDFS after each job. This causes a lot of network and disk I/O. Remember also that a just the setup of a MapReduce job itself takes some time. This example shows that many real world applications do not fit the MapReduce model. Also, the implementation of complex data flows using MapReduce is very time-consuming.

Stratosphere is able to natively execute the job. Everything is processed in-memory. Only if the data does not fit into the memory anymore, it starts using the local hard disks.


Powerful Programming Interfaces

You can write your parallel, distributed applications for Stratosphere in Java or Scala. The Java API will feel similar to Hadoop's MapReduce abstraction, but offers more functions and a more flexible data model. Scala is a functional object oriented programming language with powerful language features. Stratosphere's Scala API supports fluent and concise, yet efficient, analysis programs. Behind the scenes, Stratosphere uses code generation techniques to bridge between the Scala language and the runtime.

See our Quickstart guides for Scala and Java
Word count in Stratosphere using Scala:

val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
  .groupBy { word => word }
  .count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))

Support for Iterative Algorithms

Data Mining, Machine Learning and Graph processing algorithms often require to loop over the working data multiple times. Stratosphere supports iterative algorithms in its core. (The runtime allows for very fast iteration times and the optimizer deals with caching loop-invariant data.) The advanced incremental iterations support algorithms that focus on the "hot part" of the evolving solution and may converge even faster.

Iterative algorithms with Hadoop

Iterations with Hadoop

Iterative algorithms are implemented in Hadoop MapReduce using a central driver that spawns MapReduce jobs until the result has been computed. This approach has many disadvantages:

  • Each MapReduce Job needs at least 30 seconds for setup
  • Data is transferred only between jobs using HDFS (in-memory would be much faster)
  • Everything has to be read over and over again, even if it has already finished
  • Very time consuming to implement

Stratosphere natively executes iterative algorithms. The result of the last operator is fed back to the input of the first operator (in-memory). It is not required to start a new job on each iteration. Stratosphere detects which parts of the data need processing for further iterations. Only those are loaded.


Built-In Optimizer

Stratosphere comes with an optimizer that is independent of the actual programming interface. It chooses a fitting execution strategy depending on the inputs and operations. For example the "Join" operator will choose between partitioning and broadcasting the data, as well as between running a sort-merge-join or a hybrid hash join algorithm.

  • Cost-based optimizer choice of operators and shipping strategies.
  • In-memory pipelining of operators
  • Reduction of shipped and written data volume
  • Global memory distribution
  • Input Sampling to determine cardinalities

Focus on your application logic rather than parallel execution.


System Stack

Stratosphere seamlessly integrates into existing Hadoop setups and runs side-by-side with Hadoop's TaskTrackers and DataNodes. Stratosphere can read data from Hadoop sources, but comes with its own efficient runtime. Similar to Hadoop, Stratosphere scales by adding more machines to the cluster.
Stratosphere runs also on Hadoop 2.2 (YARN), so you do not need to change your infrastructure.
The Local execution mode allows to debug and analyze your application right from your favorite IDE, without having Stratosphere installed.


Open Source Community and Support

Stratosphere is an active, community driven open-source project. It is licensed under the Apache License.
Our friendly community is always open to new users and developers. Join us and shape the future of Big Data.


Next Steps

Download and try Stratosphere. Our Quickstart scripts make it easy for developers to create an empty Stratosphere program skeleton to start from. Dependencies are seamlessly handled by Maven without any installation. Testing and debugging is possible directly inside the IDE with Stratosphere's embedded mode. Ready-made binaries for cluster setups are available as well.

Visit bigdataclass.org for Stratosphere programming exercises.