Rationale
Table of Contents
This page describes some of the ideas that led to the development of the project and influenced various aspects of its design.
1. How Piqi compares to other Languages and Tools
Piqi is designed around the following set of concepts.
-
Data definition language (Piqi)
Having a data definition language is important because it allows to implement data validation, data mapping to static programming languages, reliable schema evolution and other features such as query and data manipulation languages, formatting functions, etc.
-
Portable binary data encoding
Portable data encoding with simple format allows efficient implementations of parsers and generators and provides a portable layer for transferring and persisting data.
-
Mapping to programming languages
It is important to have a straightforward and reliable way to deal with structured data from various programming languages.
-
Human-friendly data encoding (Piq)
A language that allows humans to conveniently read, write and edit structured data in text format.
(More specific information about how Piq compares to other notations can be found on Piq documentation page.)
There are a lot of different systems and tools that address some of these problems with varying efficiency, and, apparently, there’s no technology that provides a solution for all of them.
The Piqi project covers all the above concepts and attempts to address them as efficient as possible.
I’ll give a couple of examples to illustrate the points. Note that this is a very high-level conceptual view, it is not meant to be comprehensive.
XML wasn’t originally designed to deal with structured data and the problem domain as a whole — it was adapted and established in this role later. As a result it dabbles trying to cover (1) and (4), and generally fails at (2) and (3): it doesn’t provide a compact binary encoding and direct mapping to programming languages. Overall, XML as a whole is so awkward and bloated that a lot of engineers tend to use custom data formats or more efficient existing alternatives like, for example, JSON or Google Protocol Buffers. (XML does support some extra parts that make it more useful — for example, XPath — but they can’t solve fundamental design issue of using a markup language for representing data). James Clark, who led the development of XML, summarised many of these concerns in his blog post XML vs the Web.
JSON is a simple text-based encoding for structured data. It is very efficient when mapped to dynamic programming languages such as JavaScript and it is human-readable. This way it partially covers (3) and (4). However without a proper schema language it can’t be reliably mapped to static programming languages. Also, the fact that it is a text format doesn’t allow to process and store JSON data efficiently compared to data encoded in binary format.
Google Protocol Buffers is a great cross-language portable data serialization solution. It provides mapping for C++, Java, Python and some other languages. It offers an extensible data model, a data definition language and a binary encoding. This way it covers (1), (2), (3) and it even has a version of a text encoding — (4) but it is rudimentary.
There are some other examples.
For instance, a lot of high-level programming languages have built-in mechanisms and custom libraries for serializing language data structures. But these systems are typically not applicable outside the language they were designed for.
Middleware systems such as CORBA, ZeroC ICE, Apache Thrift make another example. They use some common data encoding, data definition language (as a part of interface definition language) and provide language mappings. This way they cover (1), (2) and (3). However they don’t usually address these concepts efficiently, as they try to cover too many things at once and usually focus more on communication part which is a separate problem domain [1].
More information about various serialization formats and technologies can be found on these Wikipedia pages:
http://en.wikipedia.org/wiki/Category:Data_serialization_formats
http://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats
2. Piqi and Programming Languages
The combination of Piq and Piqi languages can be viewed as a subset of a programming language. They provide data definition and data representation language but don’t have operational semantics.
Piq language’s concepts and syntax elements are inspired by programming languages. Furthermore, Piq is specifically designed to allow future extensions: currently it implements only data representation language, but it can be extended to support macros, data query language, general execution model, etc. Read more about it on the project’s Roadmap page.
Piqi data model is to a large extent inspired by high-level programming languages such as OCaml. In certain sense Piqi may be viewed as an extension of ML data model.
For some reason programming language researches and designers don’t pay much attention to data models. It is hard to find any new practical data models implemented by programming languages since the development of ML [2]. In reality situation is usually even worse — it is easy to find a modern static programming language that doesn’t support variant types and pattern matching, or a dynamic language that doesn’t support atoms, or a language that encourage hiding data structures behind "objects" and "classes".
It is quite surprising because most of practical programming efforts today is spent on transformation of structured data, given that all basic algorithms are already implemented and available in the libraries [3]. The remaining 5% (or maybe even less) is spent on implementation of advanced algorithms and complicated computation models like, for instance, concurrent, distributed or parallel.
My prediction is that one of major paths of programming language evolution — the one that have a potential to tremendously increase programmer’s productivity — will be related to how a programmer manipulates structured data. And the ability to efficiently manipulate structured data is likely to be based on some high-level data model implemented natively in a programming language.
For example, in a few lines of SQL it is possible to do a lot more with relational data than using any features provided by popular programming languages. Higher-order functions, lists comprehensions, dictionaries, iterators — do not come even close to what SQL could do for a programmer if a limited semantical subset of SQL were built-in into a programming language [4].
Of course, relational data model is not a complete answer. Nested data structures are in many cases more natural and useful. This is the area where pattern matching and path expressions can make programmers substantially more productive.
Notes
[1] Middleware focuses on areas such as communication channels, distributed communication model, service registration, load balancing, failover, authorization, etc.
[2] There are some notable examples but they are all domain specific languages, and the last two of them are designed to work on top of XML which is a really bad starting point. The first two were designed by Google.
[3] It may not be obvious because a lot of programming languages and techniques make data transformations look less than data transformations and more like a mess. For low-level programming languages, though, it is just the way it is.
[4] Microsoft’s LINQ seems to be successfully following this idea.
