hair is important: March 2011

There are two ways to serialize data in Avro: using code generation and not using code generation. I want to show examples of each way because I didn't find many examples online when I needed to do it. First, I will briefly cover my understanding of the general concept of each method for serialization.

When using code generation, you use the Avro tool binary, external to your application (or programatically internal to your application) to generate classes that represent the data in your schema. You populate the fields in the classes and use a specific datum writer parameterized with the class to produce the serialized data.

When you do not use code generation, you use the GenericData classes along with the schema to do a series of put() operations. You then use a generic datum writer constructed with the schema to produce the serialized data.

Example Avro Schema

This is a simple schema to use in the example. I made it have an array as a record field because I had a bit of trouble with that when I was working with such a schema. Let's say that this schema is in a file called Schema.avsc.

{
  "namespace": "my.pkg.path.avro",
  "type" : "record",
  "name" : "PeopleList",
  "fields" : [
    {"name" : "Version", "type" : "int"},
    {"name" : "People", "type" : {
      "namespace" : "my.pkg.path.avro",
      "type" : "array",
      "items" : {
        "type" : "record",
        "namespace": "my.pkg.path.avro",
        "name" : "Person",
        "fields" : [
            {"name" : "FirstName", "type" : "string"},
            {"name" : "LastName", "type" : "string"}
        ]
      }}
    }
  ]
}

This schema represents data consisting of a version and a list of records describing people. Each record has the person's first and last name.

Serialize Using Code Generation

To generate classes from outside of your program using the tool, you can invoke it like this:

java -cp <needed jars> org.apache.avro.tool.Main compile schema
    <path to avro schema> <path to put generated classes>

For the <needed jars> it will probably be a handful of dependencies. If a needed dependency can't be found when you run the tool, you'll see an error that the specific jar cannot be found. When this happened, I added it to the classpath (the -cp option) and once I had all of them, it worked.

The generator will start at the destination path specified as the second argument and create directories as specified by the namespaces in your schema. Note that you must have namespaces otherwise an exception will be thrown (NullException) and the code generation will fail. In this example, the tool will generate classes and put them in:

<path to put generated classes>/my/pkg/path/avro/.

In this example, two classes will be generated: PeopleList and Person. All of the classes contain i.e. store the schema used to generate them.

Here is a link to the Avro 1.6.1 API to refer to when looking at the code: http://avro.apache.org/docs/1.6.1/api/java/index.html.

This code snippet shows how to use the generated classes to create the serialized data and place it in a java.nio.ByteBuffer. I create a stream, a binary encoder initialized with the stream, and a specific datum writer. Then, I create a PeopleList, add three people to it, serialize it, and store it in the ByteBuffer.

ByteArrayOutputStream out = new ByteArrayOutputStream();
Encoder e = new BinaryEncoder(out);
SpecificDatumWriter<PeopleList> w = 
        new SpecificDatumWriter<PeopleList>(PeopleList.class);

PeopleList all = new PeopleList();
all.People = new GenericData.Array<Person>(
        3, all.getSchema().getField("People").schema());

Person person1 = new Person();
person1.FirstName = new Utf8("Cairne");
person1.LastName = new Utf8("Bloodhoof");
all.People.add(person1);

Person person2 = new Person();
person2.FirstName = new Utf8("Sylvanas");
person2.LastName = new Utf8("Windrunner");
all.People.add(person2);

Person person3 = new Person();
person3.FirstName = new Utf8("Grom");
person3.LastName = new Utf8("Hellscream");
all.People.add(person3);

all.Version = 1;
w.write(all, e);
e.flush();
ByteBuffer serialized = ByteBuffer.allocate(out.toByteArray().length);
serialized.put(out.toByteArray());

Serialize Without Using Code Generation

This code snippet shows how to populate GenericData objects, serialize them, and place the result in a java.nio.ByteBuffer. This is similar to the previous code. I create a stream, a binary encoder initialized with the stream, and a generic datum writer. Then, I create a GenericData.Record that will have the version and the list. Next, I create a GenericData.Array for the list, add three people (each being a GenericData.Record) to it, serialize it, and store it in the ByteBuffer. Each GenericData has to be initialized with a schema that describes its format. You can grab different sections of the overall schema using Schema methods. Check them out here: http://avro.apache.org/docs/current/api/java/org/apache/avro/Schema.html. Keep in mind as you're doing this, each put() gets validated against the schema, so if you mess up, an exception will be thrown. If you leave out any field i.e. you don't do a put() for it, an exception will be thrown when you try to write it.

Schema schema = Schema.parse(new File("Schema.avsc"));

ByteArrayOutputStream out = new ByteArrayOutputStream();
Encoder e = new BinaryEncoder(out);
GenericDatumWriter<GenericRecord> w = new GenericDatumWriter<GenericRecord>(schema);

GenericRecord all = new GenericData.Record(schema);
Schema peopleSchema = schema.getField("People").schema();
GenericArray<GenericRecord> people = new GenericData.Array<GenericRecord>(3, peopleSchema);
Schema personSchema = peopleSchema.getElementType();

GenericRecord person1 = new GenericData.Record(personSchema);
person1.put("FirstName", new Utf8("Cairne"));
person1.put("LastName", new Utf8("Bloodhoof"));
people.add(person1);

GenericRecord person2 = new GenericData.Record(personSchema);
person2.put("FirstName", new Utf8("Sylvanas"));
person2.put("LastName", new Utf8("Windrunner"));
people.add(person2);

GenericRecord person3 = new GenericData.Record(personSchema);
person3.put("FirstName", new Utf8("Grom"));
person3.put("LastName", new Utf8("Hellscream"));
people.add(person3);

all.put("People", people);
all.put("Version", 1);
w.write(all, e);
e.flush();
ByteBuffer serialized = ByteBuffer.allocate(out.toByteArray().length);
serialized.put(out.toByteArray());

Thoughts

I tried both methods of serializing my data and found that using the code generation seemed less error prone and also handy for when I made changes to the schema. I use Eclipse with the Maven Integration Plugin (http://maven.apache.org/eclipse-plugin.html) to build. I wrote a bash script to invoke the code generation tool and used the Maven AntRun Plugin (http://maven.apache.org/plugins/maven-antrun-plugin/) to run it as part of my build. That way I could make changes to my schema, do a project clean, and have updated classes generated easily. I think updating the serialization methods is easier using the generated classes instead of changing/adding/removing the GenericData objects and/or their put() operations.

I hope maybe some of this can help someone with their Avro serialization endeavors in Java.

hair is important

Thursday, March 10, 2011

Using C Format Code Macros in C++

Friday, March 04, 2011

Avro Serialization in Java

Example Avro Schema

Serialize Using Code Generation

Serialize Without Using Code Generation

Thoughts