Apache Avro: Data Serialization for Distributed Applications Training Course

Duration

14 hours (usually 2 days including breaks)

Requirements

  • A general familiarity with distributed computing.

Overview

Audience

  • Developers

Format of the Course

  • Lectures, hands-on practice, small tests along the way to gauge understanding

Course Outline

Introduction

Principles of Distributed Computing

  • Apache Spark
  • Hadoop

Principles of Data Serialization

  • How data object is passed over the network
  • Serialization of objects
  • Serialization approaches
    • Thrift
    • Protocol Buffers
    • Apache Avro
      • data structure
      • size, speed, format characteristics
      • persistent data storage
      • integration with dynamic languages
      • dynamic typing
      • schemas
        • untagged data
        • change management

Data Serialization and Distributed Computing

  • Avro as a subproject of Hadoop
    • Java serialization
    • Hadoop serialization
    • Avro serialization

Using Avro with

  • Hive (AvroSerDe)
  • Pig (AvroStorage)

Porting Existing RPC Frameworks

Summary and Conclusion

Leave a Reply

Your email address will not be published. Required fields are marked *