Beyond the success of Kotlin: a documentary about how and why Kotlin succeeded in the world of Android development.

First steps with Apache Beam in Java: tutorial

In this Apache Beam tutorial, EPAM’s Software Engineer Erick Romero shares his take on how to get started with this unified model in Java.

An astronaut with a flag on a blue background
Published in Tech matters11 October 20222 min read

Introduction

In this brief Apache Beam tutorial, you will learn how to get started with it using Java and try a quick example on your own.

All you need is:

  • Basic knowledge in Java
  • Your favorite IDE (I recommend IntelliJ IDEA)
Maven Build Tool
Learn to use Maven to create a project and manage both dependencies and the build lifecycle.
View coursearrow-right-blue.svg

What is Apache Beam?

Apache Beam is a unified model for defining both batch and streaming data parallel processing pipelines, which include ETL (Extract, Transform, Load) as well as batch and stream data processing.

Common Apache Beam use cases:

  • Creating powerful batch and stream processing of big data
  • Processing billions of events per day in real time
  • Reducing stream and batch processing costs
  • Creating robust and scalable machine learning infrastructure to allow integrations with hundreds of applications

Apache Beam has some important concepts to know, as illustrated and described below:

Pipeline incorporates a defined set of operations to apply to a dataflow as shown in the example below:

SDK is a set of dependencies to create, process, and apply transforms and PCollection to the pipeline. Apache Beam has SDKs for Java, Go, , and Python.

Runner is the most important component because it runs the pipeline. We can choose from numerous runners such as Apache Flink, Apache Spark, GCP Dataflow, and others. For our example, we’ll use a direct runner to test the pipeline locally.

PTransform is a data processing operation like ParDo, Combine that helps to process the data.

PCollection is a data stream required for the pipeline to perform the operations over the data.

Aggregation is used when we need to process data from multiple inputs and group them with a common key such as PCollection.

Get started with Apache Beam step by step

Now that you know the basics, it’s time to try a real example using Apache Beam in Java. We’re going to filter two lists with a static letter and then merge it. Let’s get started.

1. First, download this repository with all of the required dependencies to save some time.

2. Open the repository, move to the example branch, go to the PipelineDemo class, and add the following code to create a pipeline:

var pipeline = Pipeline.create();

3. Import Pipeline class from org.apache.beam.sdk. Now you’ve created a Pipeline object.

4. We can start creating PCollections to apply some transformers too, as follows:

var countriesCollection = pipeline.apply("Reading from list", 
Create.of("Colombia", "Francia", "Estados unidos", "Bolivia"));

Next, import PCollections from org.apache.beam.values, and Create.of from org.apache.beam.transform

5. Now, it’s time to apply some transformers to the Pipeline, so we’ll apply a ParDo transformer to process and extract countries that start with letters C and B.

var countriesBeginWithC = countriesCollection
        .apply("Filtering By A", ParDo.of(new DoFn<String, String>() {

    @ProcessElement
    public void processElement(@Element String elem, ProcessContext c) {
        if (elem.startsWith("C")) {
            c.output(elem);
        }
    }
}));

var countriesBeginWithB = countriesCollection
        .apply("Filtering By B", ParDo.of(new DoFn<String, String>() {

6. The previous process extracts the countries that start with a defined letter and creates two PCollections. We’re going to join the two previous outputs and apply Flatten.pCollection to flatten and get the final merged list:

var mergedCollectionWithFlatten = 
PCollectionList.of(countriesBeginWithC).and(countriesBeginWithB)
                 .apply(Flatten.pCollections());

7. Now, apply TextIO.write() transformer to see the final response in a common format like .text.

mergedCollectionWithFlatten.apply(TextIO.write().to("mergepcollections/extractwords").withoutSharding());

8. Finally, run the pipeline using a direct runner to do so locally.

pipeline.run();

The pipeline will create a mergepcollections with the following result:

That’s it! We’ve completed our quick Apache Beam tutorial, and now you know the basic steps you need to get started with it using Java. If you want to see more examples, feel free to explore the main branch.

Happy coding!

Related posts
Get the latest updates on the platforms you love