Protocol Buffers (protobuf) Knowledge

Kip Landergren

March 27, 2022 (Updated: February 07, 2024)

My Protocol Buffers (protobuf) knowledge base that evolves as I learn more.

Overview
- Core Idea
- Key Concepts
Components
Versioning
Usage
Protocol Buffers Terminology

Overview

Protocol Buffers (AKA “protobuf”) are a binary serialization toolset and language comprised of:

a lightweight schema definition language / interface definition language (IDL) for creating a language-neutral representation of data structures and services stored within .proto files
a compiler, protoc, which consumes .proto files and generates native bindings—client and server libraries—for creating, serializing, deserializing, and working with instances of those data structures and services
a binary output format

An overview of the protocol buffer compilation process, starting with the data schema defined in a .proto file, combined with user options and passed to protoc, the compiler, to generate native bindings.

This centralizes data schema definition and easily permits the scenario of a server written in one language communicating with multiple clients each in different languages. Serialized data can also be written to disk or for inter-process communication.

Common data format demonstrated by two different client libraries communicating with each other.

Core Idea

Create a toolset for:

defining a data schema
defining a service-based API
safe evolution—through backward and forward compatibility guarantees—of that schema and API
automatic generation of client and server libraries
a common serialization format to allow interchange between those generated libraries

Taken together this creates a single source of truth for data and service definitions, simplifying what was previously easily skewed per-language implementations.

Key Concepts


interface definition language / schema definition language	the language used to specify the data structure to be represented
serialization / deserialization	the process of converting a data structure into a format that can be easily transmitted, stored, and reconstructed later
services and remote procedure call (rpc)	the specification of operations between a server and client that can be performed using the data structures
binary encoding / output format	the binary format the data is output as
code generation	the automatically generated native bindings based on the schema definition that represent the data structures and the services that use them
data exchange formats, compatibility, and versioning	how a set of servers can communicate over time with changes to their formats and preserving existing functionality while safely rolling out new functionality

Components

The Language

proto2

Language Guide (proto 2)

proto3

Language Guide (proto 3)

`protoc` and related plug-ins

protoc uses a plug-in architecture for code generation. While it does contain native support for some language bindings but is otherwise augmented by separate plug-ins. These plug-ins support code generation for both data objects and remote procedure call (RPC) frameworks like gRPC.

The Wire Format

Encoding

Versioning

Protocol Buffers has to manage evolving:

protoc and the protocol buffers implementation
the generated code, through both the embedded language generators and third-party compiler plug-ins
the schema language and syntax
the wire format

The considerations are that:

wire format changes are extremely cost (and effort) prohibitive
the protocol buffers authors want to improve the implementation without breaking or being coupled to plug-ins
the plug-in authors want to improve their language-specific generators independent of implementation development
users want a way to evolve their .proto files, and the generated code, without breaking their clients or doing expansive changes to their code base

As of January 2024, the two supported versions of the IDL are proto2 and proto3. The goals of proto3 were to:

simplify the language
ensure forward compatibility
improve JSON compatibility
support default values
clear up semantics for field types
add field presence checks

While this was useful, there were still problems:

no migration tool between the two was offered
plug-ins were still coupled to the implementation
generated code had no mechanism to gracefully evolve

To address these protocol buffers will soon adopt the concept of editions. Editions, inspired by how Rust uses them, are groups of features that allow the user to opt-in to compiler behavior. A future user will be able to evolve a .proto file at their own pace by specifying an edition—the first of which will essentially be a no-op—and then opting into features as desired.

Usage

Protocol Buffers is set up to give you the means to describe your data, enforce its safe evolution, and automatically generate polyglot client and server libraries that can communicate with each other. It is opinionated about some factors—like default values and optional fields—but largely leaves you the work of defining and managing your API.

Strengths:

automatically generated bindings
forward and backward compatibility
extensibility
compact binary output format
fast parsing

Considerations:

output format is not human readable
not great for structured text (e.g. with an XML file it may be useful to have a user manually edit directly; with protocol buffers this is not possible)

Best for:

inter-service communication

Protocol Buffers Terminology

protoc: the protocol buffers compiler
native bindings: the automatically generated software libraries
wire format: the protocol buffers binary output format