Friday, March 04, 2011

Avro Serialization in Java

There are two ways to serialize data in Avro: using code generation and not using code generation. I want to show examples of each way because I didn't find many examples online when I needed to do it. First, I will briefly cover my understanding of the general concept of each method for serialization.

When using code generation, you use the Avro tool binary, external to your application (or programatically internal to your application) to generate classes that represent the data in your schema. You populate the fields in the classes and use a specific datum writer parameterized with the class to produce the serialized data.

When you do not use code generation, you use the GenericData classes along with the schema to do a series of put() operations. You then use a generic datum writer constructed with the schema to produce the serialized data.

Example Avro Schema


This is a simple schema to use in the example. I made it have an array as a record field because I had a bit of trouble with that when I was working with such a schema. Let's say that this schema is in a file called Schema.avsc.

{
  "namespace": "my.pkg.path.avro",
  "type" : "record",
  "name" : "PeopleList",
  "fields" : [
    {"name" : "Version", "type" : "int"},
    {"name" : "People", "type" : {
      "namespace" : "my.pkg.path.avro",
      "type" : "array",
      "items" : {
        "type" : "record",
        "namespace": "my.pkg.path.avro",
        "name" : "Person",
        "fields" : [
            {"name" : "FirstName", "type" : "string"},
            {"name" : "LastName", "type" : "string"}
        ]
      }}
    }
  ]
}

This schema represents data consisting of a version and a list of records describing people. Each record has the person's first and last name.

Serialize Using Code Generation


To generate classes from outside of your program using the tool, you can invoke it like this:

java -cp <needed jars> org.apache.avro.tool.Main compile schema
    <path to avro schema> <path to put generated classes>

For the <needed jars> it will probably be a handful of dependencies. If a needed dependency can't be found when you run the tool, you'll see an error that the specific jar cannot be found. When this happened, I added it to the classpath (the -cp option) and once I had all of them, it worked.

The generator will start at the destination path specified as the second argument and create directories as specified by the namespaces in your schema. Note that you must have namespaces otherwise an exception will be thrown (NullException) and the code generation will fail. In this example, the tool will generate classes and put them in:

<path to put generated classes>/my/pkg/path/avro/.

In this example, two classes will be generated: PeopleList and Person. All of the classes contain i.e. store the schema used to generate them.

Here is a link to the Avro 1.6.1 API to refer to when looking at the code: http://avro.apache.org/docs/1.6.1/api/java/index.html.

This code snippet shows how to use the generated classes to create the serialized data and place it in a java.nio.ByteBuffer. I create a stream, a binary encoder initialized with the stream, and a specific datum writer. Then, I create a PeopleList, add three people to it, serialize it, and store it in the ByteBuffer.

ByteArrayOutputStream out = new ByteArrayOutputStream();
Encoder e = new BinaryEncoder(out);
SpecificDatumWriter<PeopleList> w = 
        new SpecificDatumWriter<PeopleList>(PeopleList.class);

PeopleList all = new PeopleList();
all.People = new GenericData.Array<Person>(
        3, all.getSchema().getField("People").schema());

Person person1 = new Person();
person1.FirstName = new Utf8("Cairne");
person1.LastName = new Utf8("Bloodhoof");
all.People.add(person1);

Person person2 = new Person();
person2.FirstName = new Utf8("Sylvanas");
person2.LastName = new Utf8("Windrunner");
all.People.add(person2);

Person person3 = new Person();
person3.FirstName = new Utf8("Grom");
person3.LastName = new Utf8("Hellscream");
all.People.add(person3);

all.Version = 1;
w.write(all, e);
e.flush();
ByteBuffer serialized = ByteBuffer.allocate(out.toByteArray().length);
serialized.put(out.toByteArray());

Serialize Without Using Code Generation


This code snippet shows how to populate GenericData objects, serialize them, and place the result in a java.nio.ByteBuffer. This is similar to the previous code. I create a stream, a binary encoder initialized with the stream, and a generic datum writer. Then, I create a GenericData.Record that will have the version and the list. Next, I create a GenericData.Array for the list, add three people (each being a GenericData.Record) to it, serialize it, and store it in the ByteBuffer. Each GenericData has to be initialized with a schema that describes its format. You can grab different sections of the overall schema using Schema methods. Check them out here: http://avro.apache.org/docs/current/api/java/org/apache/avro/Schema.html. Keep in mind as you're doing this, each put() gets validated against the schema, so if you mess up, an exception will be thrown. If you leave out any field i.e. you don't do a put() for it, an exception will be thrown when you try to write it.

Schema schema = Schema.parse(new File("Schema.avsc"));

ByteArrayOutputStream out = new ByteArrayOutputStream();
Encoder e = new BinaryEncoder(out);
GenericDatumWriter<GenericRecord> w = new GenericDatumWriter<GenericRecord>(schema);

GenericRecord all = new GenericData.Record(schema);
Schema peopleSchema = schema.getField("People").schema();
GenericArray<GenericRecord> people = new GenericData.Array<GenericRecord>(3, peopleSchema);
Schema personSchema = peopleSchema.getElementType();

GenericRecord person1 = new GenericData.Record(personSchema);
person1.put("FirstName", new Utf8("Cairne"));
person1.put("LastName", new Utf8("Bloodhoof"));
people.add(person1);

GenericRecord person2 = new GenericData.Record(personSchema);
person2.put("FirstName", new Utf8("Sylvanas"));
person2.put("LastName", new Utf8("Windrunner"));
people.add(person2);

GenericRecord person3 = new GenericData.Record(personSchema);
person3.put("FirstName", new Utf8("Grom"));
person3.put("LastName", new Utf8("Hellscream"));
people.add(person3);

all.put("People", people);
all.put("Version", 1);
w.write(all, e);
e.flush();
ByteBuffer serialized = ByteBuffer.allocate(out.toByteArray().length);
serialized.put(out.toByteArray());

Thoughts


I tried both methods of serializing my data and found that using the code generation seemed less error prone and also handy for when I made changes to the schema. I use Eclipse with the Maven Integration Plugin (http://maven.apache.org/eclipse-plugin.html) to build. I wrote a bash script to invoke the code generation tool and used the Maven AntRun Plugin (http://maven.apache.org/plugins/maven-antrun-plugin/) to run it as part of my build. That way I could make changes to my schema, do a project clean, and have updated classes generated easily. I think updating the serialization methods is easier using the generated classes instead of changing/adding/removing the GenericData objects and/or their put() operations.

I hope maybe some of this can help someone with their Avro serialization endeavors in Java.

22 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Twas helpful. Thanks.

    ReplyDelete
  3. Hello, from Tom White's Hadoop book I cannot get the following to work using Avro 1.6.1. When the JVM attempts creation of a new SpecificDatumWriter(Pair.class), I get a NullPointerException when Avro's SpecificData class tries to create a schema. In the creation of the schema, there is a replace method being called which causes this NullPointerException. Avro 1.6.1 deprecates DecoderFactory.defaultFactory() and changes BinaryEncoder to an Abstract class. Your help would be greatly appreciated. Thank you!

    See:
    // why doesn't this work in Avro 1.6.1.
    @Test
    public void testGeneratedPairClass() {
    Pair datum = new Pair();
    datum.left = new Utf8("L");
    datum.right = new Utf8("R");

    ByteArrayOutputStream out = new ByteArrayOutputStream();
    Encoder encoder = new EncoderFactory().binaryEncoder(out, null);
    DatumWriter writer = new SpecificDatumWriter(Pair.class);

    try {
    writer.write(datum, encoder);
    encoder.flush();
    out.close();
    } catch (IOException e) {
    e.printStackTrace();
    }

    Decoder decoder = new DecoderFactory().binaryDecoder(out.toByteArray(), null);
    DatumReader reader = new SpecificDatumReader(Pair.class);
    Pair result;
    try {
    result = reader.read(null, decoder);
    assertThat(result.left.toString(), is("L"));
    assertThat(result.right.toString(), is("R"));
    } catch (IOException e) {
    e.printStackTrace();
    }
    }

    ReplyDelete
    Replies
    1. The problem you're seeing is happening because the full name of your generated class doesn't match the name you gave in the schema AND you didn't specify a namespace in your schema. In my example schema, the name is "PeopleList". You can check your Pair class full name like this:

      Class c = (Class)Pair.class;
      System.out.println(c.getName());

      If you're curious, here's the code (http://svn.apache.org/viewvc/avro/tags/release-1.6.1/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java?revision=1201974&view=markup) where the exception gets thrown:

      179 } else if (type instanceof Class) { // class
      180 Class c = (Class)type;
      181 String fullName = c.getName();
      182 Schema schema = names.get(fullName);
      183 if (schema == null)
      184 try {
      185 schema = (Schema)(c.getDeclaredField("SCHEMA$").get(null));
      186
      187 if (!fullName.equals(getClassName(schema)))
      188 // HACK: schema mismatches class. maven shade plugin? try replacing.
      189 schema = Schema.parse
      190 (schema.toString().replace(schema.getNamespace(),
      191 c.getPackage().getName()));
      192 } catch (NoSuchFieldException e) {
      193 throw new AvroRuntimeException(e);
      194 } catch (IllegalAccessException e) {
      195 throw new AvroRuntimeException(e);
      196 }

      It first checks if the full class name matches the name in the schema. If it doesn't, it tries to do a string replace to replace the namespace with the package name. Your namespace is probably null.

      To fix the problem, either make the name in the schema match the class full name OR add a namespace to your schema. I recommend adding a namespace to the schema in any case because I've encountered problems in the past (http://stylishtoupee.blogspot.com/2011/02/null-exception-when-generating-classes.html) when I didn't specify a namespace in my schema. My example schema shows how to specify a namespace.

      Delete
  4. Not only for java, this clarify some basic concepts of avro. Great!

    ReplyDelete
  5. Thank you for sharing and illustrating use of the built-in Avro toolset.

    ReplyDelete
  6. Hi,
    I'm trying to convert a large XML file into avro. I have the xsd's and I'm able to convert it to avro schema and corresponding jave classes using code generation. Is there a way to convert the XML to java?

    ReplyDelete