The Piqi Project » Rationale

Rationale

Table of Contents

This page describes some of the ideas that led to the development of the Piqi project and influenced various aspects of its design.

Piqi is designed around the following set of concepts.

Universal schema language (Piqi)

Having a data definition language is important because it allows to implement data validation, data mapping to static programming languages, reliable schema evolution and other features such as query and data manipulation languages, etc.

Being universal means that data definitions can be mapped to a variety of data formats and target languages while offering full control over details of such mappings.

This, in turn, means that the data model needs to be rather generic yet flexible. Also, the schema language has to be extensible to be able to support new mappings without changing the language.
Portable data formats

Portable data formats provide a layer for exchanging and persisting structured data across different systems. Such systems can be distributed in space and time and written using different languages.
Mapping to programming languages

It is important to have a straightforward and reliable way to deal with structured data from various programming languages.
Human-friendly data format (Piq)

A language that allows humans to conveniently read, write and edit structured data in text format.
Tools

A set of tools for type-based data validation, conversion, schema language manipulation, building custom mappers and so on.

The Piqi project is centered around Piqi, a universal schema language. Instead of reinventing everything, it binds together some of the established tools such as JSON, XML and Protocol Buffers and provides a high level of compatibility and interoperation between them. Overall, it offers users a choice of which data format to use and does it without sacrificing flexibility of the data model and underlying mapping between the formats and programming languages.

Just a couple of examples to illustrate the points.

XML wasn’t originally designed to deal with structured data and the problem domain as a whole — it was adapted and established in this role later. As a result it dabbles trying to cover (1) and (4), and generally fails at (2) and (3): it doesn’t provide a compact binary encoding and direct mapping to programming languages. Overall, XML as a whole is so awkward and bloated that, in many cases, people tend to use custom data formats or more efficient existing alternatives like, for example, JSON or Google Protocol Buffers. (To be fair, XML does support some extra parts that make it more useful — for example, XPath — but they can’t solve fundamental design issue of using a markup language for representing data). James Clark, who led the development of XML, summarized many of these concerns in his blog post XML vs the Web.

JSON is a simple human-readable encoding for structured data. It is very efficient when mapped to dynamic programming languages such as JavaScript. This way it partially covers (3) and (4). However without a proper schema language it can’t be reliably mapped to static programming languages. Also, the fact that it is a text format doesn’t allow to process and store JSON data efficiently compared to data encoded in binary format.

Google Protocol Buffers is a great cross-language portable data serialization tool. It provides mapping for C++, Java, Python and many other languages. It offers an extensible data model, a schema language and a binary encoding. This way it covers (1), (2), (3) and it even has a version of a text encoding — (4) but it is rather rudimentary. Although it is a very powerful and well-designed tool, it has some limitations. For instance, the data model doesn’t have tagged unions (aka variants). Also, the schema language is not very extensible which makes it more difficult to build mappings for target languages and adapt it to new use-cases.

There are some other examples.

ASN.1 is a family of standards originating from the old telecom industry. While it does provide solutions for some of the problems, this "designed by committee" monster has never seen serious adoption in a larger software context and can be considered a legacy.

Many high-level programming languages offer built-in mechanisms for serializing language data structures. Such systems are typically not portable and not applicable outside the scope of the language they were designed for.

Middleware systems such as CORBA, ZeroC ICE, Apache Thrift make another example. They use some common data encoding, data definition language (as a part of interface definition language) and provide language mappings. This way they cover (1), (2) and (3). However, usually, they don’t address these concepts efficiently, as they try to cover too many things at once and focus more on communication part which is a completely separate problem domain (which includes distributed communication model, service registration, load balancing, failover, authorization, etc).

More information about various serialization formats and technologies can be found on these Wikipedia pages:

http://en.wikipedia.org/wiki/Category:Data_serialization_formats

http://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats

Last modified: March 30, 2014