Protocol Buffers (protobuf) Knowledge
Kip Landergren
(Updated: )
My Protocol Buffers (protobuf) knowledge base that evolves as I learn more.
Contents
Overview
Protocol Buffers (AKA “protobuf”) are a binary serialization toolset and language comprised of:
- a lightweight schema definition language / interface definition language (IDL) for creating a language-neutral representation of data structures and services stored within .proto files
- a compiler,
protoc
, which consumes .proto files and generates native bindings—client and server libraries—for creating, serializing, deserializing, and working with instances of those data structures and services - a binary output format
This centralizes data schema definition and easily permits the scenario of a server written in one language communicating with multiple clients each in different languages. Serialized data can also be written to disk or for inter-process communication.
Core Idea
Create a toolset for:
- defining a data schema
- defining a service-based API
- safe evolution—through backward and forward compatibility guarantees—of that schema and API
- automatic generation of client and server libraries
- a common serialization format to allow interchange between those generated libraries
Taken together this creates a single source of truth for data and service definitions, simplifying what was previously easily skewed per-language implementations.
Key Concepts
interface definition language / schema definition language | the language used to specify the data structure to be represented |
serialization / deserialization | the process of converting a data structure into a format that can be easily transmitted, stored, and reconstructed later |
services and remote procedure call (rpc) | the specification of operations between a server and client that can be performed using the data structures |
binary encoding / output format | the binary format the data is output as |
code generation | the automatically generated native bindings based on the schema definition that represent the data structures and the services that use them |
data exchange formats, compatibility, and versioning | how a set of servers can communicate over time with changes to their formats and preserving existing functionality while safely rolling out new functionality |
Components
The Language
proto2
proto3
protoc
and related plug-ins
protoc
uses a plug-in architecture for code generation. While it does contain native support for some language bindings but is otherwise augmented by separate plug-ins. These plug-ins support code generation for both data objects and remote procedure call (RPC) frameworks like gRPC.
The Wire Format
Versioning
Protocol Buffers has to manage evolving:
protoc
and the protocol buffers implementation- the generated code, through both the embedded language generators and third-party compiler plug-ins
- the schema language and syntax
- the wire format
The considerations are that:
- wire format changes are extremely cost (and effort) prohibitive
- the protocol buffers authors want to improve the implementation without breaking or being coupled to plug-ins
- the plug-in authors want to improve their language-specific generators independent of implementation development
- users want a way to evolve their .proto files, and the generated code, without breaking their clients or doing expansive changes to their code base
As of January 2024, the two supported versions of the IDL are proto2 and proto3. The goals of proto3 were to:
- simplify the language
- ensure forward compatibility
- improve JSON compatibility
- support default values
- clear up semantics for field types
- add field presence checks
While this was useful, there were still problems:
- no migration tool between the two was offered
- plug-ins were still coupled to the implementation
- generated code had no mechanism to gracefully evolve
To address these protocol buffers will soon adopt the concept of editions. Editions, inspired by how Rust uses them, are groups of features that allow the user to opt-in to compiler behavior. A future user will be able to evolve a .proto file at their own pace by specifying an edition—the first of which will essentially be a no-op—and then opting into features as desired.
Usage
Protocol Buffers is set up to give you the means to describe your data, enforce its safe evolution, and automatically generate polyglot client and server libraries that can communicate with each other. It is opinionated about some factors—like default values and optional fields—but largely leaves you the work of defining and managing your API.
Strengths:
- automatically generated bindings
- forward and backward compatibility
- extensibility
- compact binary output format
- fast parsing
Considerations:
- output format is not human readable
- not great for structured text (e.g. with an XML file it may be useful to have a user manually edit directly; with protocol buffers this is not possible)
Best for:
- inter-service communication
Protocol Buffers Terminology
protoc
- the protocol buffers compiler
- native bindings
- the automatically generated software libraries
- wire format
- the protocol buffers binary output format