Wednesday, September 17, 2014

PGObject Cookbook Part 2.1: Serialization and Deserialization of Numeric Fields

Preface


This article demonstrates the simplest cases regarding autoserialization and deserialization to the database of objects in PGObject.   It also demonstrates a minimal subset of the problems that three valued logic introduces and the most general solutions to those problems.  The next article in this series will address more specific solutions and more complex scenarios.

The Problems


Often times we want to have database fields automatically turned into object types which are useful to an application.  The example here turns SQL numeric fields into Perl Math::Bigfloat objects. However the transformation isn't perfect and if not carefully done can be lossy.  Most applications types don't support database nulls properly and therefore a NULL making a round trip may end up with an unexpected value if we aren't careful.  Therefore we have to create our type in a way which can make round trips in a proper, lossless way.

NULLs introduce another subtle problem with such mappings, in that object methods are usually not prepared to handle them properly.  One solution here is to try to follow the basic functional programming approach and copy on write.  This prevents a lot of problems.  Most Math::BigFloat operations do not mutate the objects so we are relatively safe there, but we still have to be careful.

The simplest way to address this is to build into one's approach a basic sensitivity into three value logic.  However, this poses a number of problems, in that one can accidentally assign a value which can have other values which can impact things elsewhere.

A key principle on all our types is that they should handle a null round trip properly for the data type, i.e. a null from the db should be turned into a null on database insert.  We generally allow programmers to check the types for nulls, but don't explicitly handle them with three value logic in the application (that's the programmer's job).

The Example Module and Repository


This article follows the code of PGObject::Type::BigFloat..  The code is licensed under the two-clause BSD license as is the rest of the PGObject framework.  You can read the code to see the boilerplate.  I won't be including it in here.  I will though note that this extends the Math::BigFloat library which provides arbitrary precision arithmetic for PostgreSQL and is a good match for LedgerSMB's numeric types.

NULL handling


To solve the problem of null inputs we extend the hashref slightly with a key _pgobject_undef and allow this to be set or checked by applications with a function "is_undef."  This is fairly trivial:

sub is_undef {
    my ($self, $set) = @_;
    $self->{_pgobject_undef} = $set if defined $set;
    return $self->{_pgobject_undef};
}

How PGObject Serializes


When a stored procedure is called, the mapper class calls PGObject::call_procedure with an enumerated set of arguments.  A query is generated to call the procedure, and each argument is checked for a "to_db" method.  That method, if it exists, is called and the output used instead of the argument provided.  This allows an object to specify how it is serialized.

The to_db method may return either a literal value or a hashref with two keys, type and value.  If the latter, the value is used as the value literal and the type is the cast type (i.e. it generates ?::type for the placeholder and binds the value to it).  This hash approach is automatically used when bytea arguments are found.

The code used by PGObject::Type::BigFloat is simple:

sub to_db {
    my $self = shift @_;
    return undef if $self->is_undef;
    return $self->bstr;
}

Any type of course can specify a to_db method for serialization purposes.

How and When PGObject Deserializes


Unlike serialization, deserialization from the database can't happen automatically without the developer specifying which database types correspond to which application classes, because multiple types could serialize into the same application classes.  We might even want different portions of an application (for example in a database migration tool) to handle these differently.

For this reason, PGObject has what is called a "type registry" which specifies which types are deserialized and as what.  The type registry is optionally segmented into several "registries" but most uses will in fact simply use the default registry and assume the whole application wants to use the same mappings.  If a registry is not specified the default subregistry is used and that is consistent throughout the framework.

Registering a type is fairly straight forward but mostly amounts to boilerplate code in both the type handler and using scripts.  For this type handler:

sub register{
    my $self = shift @_;
    croak "Can't pass reference to register \n".
          "Hint: use the class instead of the object" if ref $self;
    my %args = @_;
    my $registry = $args{registry};
    $registry ||= 'default';
    my $types = $args{types};
    $types = ['float4', 'float8', 'numeric'] unless defined $types and @$types;
    for my $type (@$types){
        my $ret =
            PGObject->register_type(registry => $registry, pg_type => $type,
                                  perl_class => $self);
        return $ret unless $ret;
    }
    return 1;
}

Then we can just call this in another script as:

PGObject::Type::BigFloat->register;

Or we can specify a subset of types or different types, or the like.

The deserialization logic is handled by a method called 'from_db' which takes in the database literal and returns the blessed object.  In this case:

sub from_db {
    my ($self, $value) = @_;
    my $obj = "$self"->new($value);
    $obj->is_undef(1) if ! defined $value;
    return $obj;
}

This supports subclassing, which is in fact the major use case.

Use Cases


This module is used as the database interface for numeric types in the LedgerSMB 1.5 codebase.  We subclass this module and add support for localized input and output (with different decimal and thousands separators).  This gives us a data type which can present itself to the user as one format and to the database as another.  The module could be further subclassed to make nulls contageous (which in this module they are not) and the like.

Caveats


PGObject::Type::BigFloat does not currently handle making the null handling contageous and this module as such probably never will, as this is part of our philosophy of handing control to the programmer.  Those who do want contageous nulls can override additional methods from Math::BigFloat to provide such in subclasses.

A single null can go from the db into the application and return to the db and be serialized as a null, but a running total of nulls will be saved in the db as a 0.  To this point, that behavior is probably correct.  More specific handling of nulls in the application, however, is passed to the developer which can check the is_undef method.

Next In Series:  Advanced Serialization and Deserialization:  Dates, Times, and JSON

Monday, September 15, 2014

PGObject Cookbook Part 1: Introduction

Preface


I have decided to put together a PGObject Cookbook, showing the power of this framework.  If anyone is interested in porting the db-looking sides to other languages, please let me know.  I would be glad to provide whatever help my time and skills allow.

The PGObject framework is a framework for integrated intelligent PostgreSQL databases into Perl applications.  It addresses some of the same problems as ORMs but does so in a very different way.  Some modules are almost ORM-like and more such modules are likely to be added in the future.  However unlike an ORM, PGObject mostly serves as an interface to stored procedures and whatever code generation routines will be added, these are not intended to be quickly changed.  Moreover it only supports PostgreSQL because we make extended use of PostgreSQL-only features.

For those who are clearly not interested in Perl, this series may still be interesting as it not only covers how to use the framework but also various problems that happen when we integrate databases with applications.  And there are people who should not use this framework because it is not the right tool for the job.  For example, if you are writing an application that must support many different database systems, you probably will get more out of an ORM than you will this framework.  But you still may get some interesting stuff from this series so feel free to enjoy it.

Along the way this will explore a lot of common problems that happen when writing database-centric applications and how these can be solved using the PGObject framework.  Other solutions of course exist and hopefully we can talk about these in the comments.

Much of the content here (outside of the prefaces) will go into a documentation module on CPAN.  However I expect it to also be of far more general interest since the problems are common problems across frameworks.

Introduction


PGObject is written under the theory that the database will be built as a server of information and only loosely tied to the application.  Therefore stored procedures should be able to add additional parameters without expecting that the application knows what to put there, so if the parameter can accept a null and provide the same answer as before, the application can be assured that the database is still usable.

The framework also includes a fairly large number of other capabilities.  As we work through we will go through the main areas of functionality one at a time, building on the simplest capabilities and moving onto the more advanced.  In general these capabilities can be grouped into basic, intermediate, and advanced:

Basic Functionality



  1. registered types, autoserialization, and autodeserialization.
  2. The simple stored procedure mapper
  3. Aggregates and ordering
  4. Declarative mapped methods

Intermediate Functionality


  1. The Bulk Loader
  2. The Composite Type stored procedure mapper
  3. The database admin functions

Advanced Functionality


  1. Memoization of Catalog Lookups
  2. Writing your own stored procedure mapper

This series will cover all the above functionality and likely more.  As we get through the series, I hope that it will start to make sense and we will start to get a lot more discussion (and hopefully use) surrounding the framework.

Design Principles


The PGObject framework came out of a few years of experience building and maintaining LedgerSMB 1.3.  In general we took what we liked and what seemed to work well and rewrote those things that didn't.  Our overall approach has been based on the following principles:
  • SQL-centric: Declarative, hand-coded SQL is usually more productive than application programming languages.  The system should leverage hand-coded SQL.
  • Leveraging Stored Procedures and Query Generators: The system should avoid having people generate SQL queries themselves as strings and executing them.  It's better to store them persistently in the db  or generate well-understood queries in general ways where necessary.
  • Flexible and Robust: It should be possible to extend a stored procedure's functionality (and arguments) without breaking existing applications.
  • DB-centric but Loosely Coupled:  The framework assumes that databases are the center of the environment, and that it is a self-contained service in its own right.  Applications need not be broken because the db structure changed, and the DB should be able to tell the application what inputs it expects.
  • Don't Make Unnecessary Decisions for the Developer:  Applications may use a framework in many atypical ways and we should support them.  This means that very often instead of assuming a single database connection, we instead provide hooks in the framework so the developer can decide how to approach this.  Consequently you can expect your application to have to slightly extend the framework to configure it.
This framework is likely to be very different from anything else you have used.  While it shares some similarities with iBatis in the Java world, it is unique in the sense that the SQL is stored in the database, not in config files.  And while it was originally inspired by a number of technologies (including both REST and SOAP/WSDL), it is very much unlike any other framework I have come across.

Next in Series:  Registered Types:  Autoserialization and Deserialization between Numeric and Math::BigFloat.

Sunday, September 14, 2014

LedgerSMB 1.4.0 Released


15 September 2014, London. The LedgerSMB project - all-volunteer developers and contributors - today announced LedgerSMB 1.4.0.

Based on an open source code base first released in 1999, the LedgerSMB project was formed in 2006 and saw it's 1.0 release in the same year. It has now seen continuous development for over eight years and that shows no signs of slowing down.

"LedgerSMB 1.4 brings major improvements that many businesses need," said Chris Travers, who helped found the project. "Businesses which do manufacturing or retail, or need features like funds accounting will certainly get much more out of this new release."

Better Productivity


LedgerSMB 1.4 features a redesigned contact management framework that allows businesses to better keep track of customers, vendors, employers, sales leads, and more. Contacts can be stored and categorized, and leads can be converted into sales accounts.

Additionally, a new import module has been included that allows businesses to upload csv text files to import financial transactions and much more. No longer is data entry something that needs to be done entirely by hand or involves customizing the software.

Many smaller enhancements are here as well, For example, shipping labels can now be printed for invoices and orders, user management workflows have been improved,

Better Reporting


The reporting interfaces have been rewritten in LedgerSMB 1.4.0 in order to provide greater flexibility in both reporting and in sharing reports. Almost all reports now include a variety of formatting options including PDF and CSV formats. Reports can also be easily shared within an organization using stable hyperlinks to reports. Additionally the inclusion of a reporting engine means that it is now relatively simple to write third-party reports which offer all these features. Such reports can easily integrate with LedgerSMB or be accessed via a third party web page.

Additionally, the new reporting units system provides a great deal more flexibility in tracking money and resources as they travel through the system. Not only can one track by project or department, but funds accounting and other specialized reporting needs are possible to meet.

Better Integration


Integration of third-party line of business applications is also something which continues to improve. While all integration is possible, owing to the open nature of the code and db structure, it has become easier as more logic is moved to where it can be easily discovered by applications.

There are two major improvement areas in 1.4. First additional critical information, particularly regarding manufacturing and cost of goods sold tracking, has been moved into the database where it can be easily shared by other applications. This also allows for better testability and support. Secondly LedgerSMB now offers a framework for web services, which are currently available for contact management purposes, allowing integrators to more easily connect programs together.

Commercial Options


LedgerSMB isn't just an open source project. A number of commercial companies offer support, hosting, and customization services for this ERP. A list of some of the most prominant commercial companies involved can be found at http://ledgersmb.org/topic/commercial-support

Thursday, September 11, 2014

Math and SQL Part 6: The Problem with NULLs

This will be the final installment on Math and SQL and will cover the problem with NULLs.  NULL handling is probably the most poorly thought-out feature of SQL and is inconsistent generally with the relational model.  Worse, a clear mathematical approach to NULLs is impossible with SQL because too many different meanings are attached to the same value.

Unfortunately, nulls are also indispensable because wider tables are more expressive than narrower tables.  This makes advice such as "don't allow nulls in your database" somewhat dangerous because one ends up having to add them back in fairly frequently.

At the same time understanding the problems that NULLs introduce is key to avoiding the worst of the problems and managing the rest.

Definition of a Null Set


A null set is simply a set with no members.  This brings us to the most obvious case of the use of a NULL, used when an outer join results in a row not being found.  This sort of use by itself doesn't do too much harm but the inherent semantic ambiguity of "what does that mean?" also means you can't just substitute join tables for nullable columns and solve the problems that NULLs bring into the database. This will hopefully become more clear below.

Null as Unknown


The first major problem surfaces when we ask the question, "when I do a left join and the row to the right is not found, does that mean we don't know the answer yet or that there is no value associated?"  In all cases, a missing result from an outer join will sometimes mean that the answer is not yet known, if only because we are still inserting the data in stages.  But it can also mean that maybe there is an answer and that there is no value associated.  In almost all databases, this may also be the case in this situation.

But then there is no additional harm done in allowing NULLs to represent unknowns in the tables themselves, right?

Handling NULLs as unknown values complicates database design and introduces problems so many experts like Chris Date tend to be generally against their use.  The problem is that using joins doesn't solve the problem but instead only creates additional failure cases to be aware of.  So very often times, people do use NULL in the database to mean unknown despite the problems.

NULL as unknown introduces problems to predicate logic because it introduces three value logic (true, false, and unknown), but these are typically only problems when one is storing a value (as opposed to a reference such as a key) in the table.  1 + NULL IS NULL.  NULL OR FALSE IS NULL.  NULL OR TRUE IS TRUE.  This makes things complicated.  But sometimes we must....

Null as Not Applicable


One severe antipattern that is frequently seen is the use of NULL to mean "Not Applicable" or "No Value."  There are a few data types which have no natural empty/no-op types.  Prime among these are numeric types.  Worse, Oracle treats NULL as the same value as an empty string for VARCHAR types.

Now, the obvious problem here is that the database does't know here that NULL is not unknown, and therefore you end up having to track this yourself, use COALESCE() functions to convert to sane values, etc.  In general, if you can avoid using NULL to mean "Not Applicable" you will find that worthwhile.

Now, if you have to do this, one strategy to make this manageable is to include other fields to tell you what the null means.  Consider for example:

CREATE TABLE wage_class (
   id int not null,
   label text not null
);

INSERT INTO wage_class VALUES(1, 'salary'), (2, 'hourly');

CREATE TABLE wage (
   ssn text not null,
   emp_id int not null,
   wage_class int not null references wage_class(id),
   hourly_wage numeric,
   salary numeric,
   check (wage_class = 1 or salary is null),
   check (wage_class = 2 or hourly_wage is null)
);

This approach allows us to select and handle logic based on the wage class and therefore we know based on the wage_class field whether hourly_wage is applicable or not.  This is far cleaner and allows for better handling in queries than just putting nulls in and expecting them to be semantically meaningful.  This solution can also be quite helpful because it ensures that one does not accidentally process an hourly wage as a salary or vice versa.

What Nulls Do to Predicate Logic


Because NULLs can represent unknowns, they introduce three-valued predicate logic.  This itself can be pretty nasty.  Consider the very subtle difference between:

   WHERE ssn like '1234%' AND salary < 50000

vs

    WHERE ssn like '1234%' AND salary < 50000 IS NOT FALSE

The latter will pull in hourly employees as well, as they have a NULL salary.

Nulls and Constraints


Despite all the problems, NULLs have become a bit of a necessary evil.  Constraints are a big part of the reason why.

Constraints are far simpler to maintain if they are self-contained in a tuple and therefore require no further table access to verify.  This means that wider tables admit to more expression relating to constraints than narrow tables.

In the example above, we can ensure that every hourly employee has no salary, and every salaried employee has no hourly wage.  This level of mutual exclusion would not be possible if we were to break off salaries and wages into separate, joined tables.

Nulls and Foreign Keys


Foreign keys are a special case of NULLs where the use is routine and poses no problems.  NULL always means "no record referenced" in this context and because of the specifics of three-valued boolean logic, they always drop out of join conditions.

NULLs in foreign keys make foreign key constraints and 5th Normal Form possible in many cases where it would not be otherwise.  Consequently they can be used routinely here with few if any ill effects.

What Nulls Should Have Looked Like:  NULL, NOVALUE, UNKNOWN


In retrospect, SQL would be cleaner if we could be more verbose about what we mean by a NULL.  UNKNOWN could then be reserved for rare cases where we really must need to store a record with incomplete data in it.  NULL could be returned from outer joins, and NOVALUE could be used for foreign keys and places where we know the field is not applicable.

Tuesday, August 19, 2014

Math and SQL Part 5: Projection and Selection

The SELECT statement is the workhorse of SQL.  Updates and inserts are necessary, but selects are where the power is.  One of the significant issues many people have in understanding these and using them is a clear understanding of the math involved, in part because there are a large number of implicit possibilities in the syntax.  In general folks learn to avoid the implicit aspects of the statement, but there are some implicit odditities that can't go away because they are baked into the language.  One of these is the confusion between selection and projection, and because SQL operates on bags instead of sets, neither of these work in SQL quite like they do in relational math.

In "Relational Theory and SQL," Chris Date says that tuples are unordered.  Tuples are strongly ordered, but what Date is getting at is that a relation has no natural ordering of columns, and therefore from any relation with a given ordering, a new relation can be created with a different ordering.  This process is called 'projection' and it is entirely implicit in SQL.  In the interest of being concise, in that book, Date does not stick with his typical approach of starting from clear definitions and discussing what these mean.

This being said, understanding projection, selection, and the difference between them makes understanding relational databases far easier in my view.

Because SQL operates on bags, I will use both SQL (for bag examples) and Set::Relation (for set examples) in order to discuss the math differences.  I will finally offer an idea of what a clearer English rendition of SQL's math semantics would be.

I:  Projection - Transform One Relation into Vertical Subset


Projection, represented by a Pi character (π) creates a set of ordered tuples with an ordering based on the operation.  What projection does, essentially, is take a subset (or the whole set) from each tuple, possibly re-ordering it in the process, in order to transform one relation into a derived relation.

In relational algebra, projection takes a set of tuples and returns a set of tuples with a subset of fields (possibly re-arranged).

For example, consider a relation R, where the tuple structure of R is (employee_id, first_name, last_name, badge_number, office_id, tax_id)


(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)
(4JaredYoung2533-44-5442)
(5ToddKals2532-44-2332)

Suppose the data looks like this:

Then we could write π(last_name, office_id)(R), we would get:
(last_nameoffice_id)
(Young1)
(James1)
(Young2)
(Kals2)

Note that there is one fewer row in the output table because in both cases, you have a well-formed set.

The equivalent of the previous expression in SQL would be:

SELECT DISTINCT last_name, office_id FROM R;

From these examples a few things should be fairly clear.  First tuples are strongly ordered and we give human-readable labels as a nicety, but that projection can be used to reorder the tuple elements arbitrarily. Consequently tuples are consistently ordered within a relation, but that order is largely arbitrary.  Consequently tuples cannot be said to have a natural order.  Instead they derive their structure and order from the relation they are a part of.

Unfortunately SQL makes all kinds of implicit tuple transformations either by implicit projection or by some sort of reverse operation (I will call that "overlay" for the moment).  Because projection is entirely implicit in SQL, it is never stated.  The SQL statement above in fact doesn't perform a selection operation but merely performs a projection.

II:  Selection - Turn a Relation into a Horizontal Subset


Selection takes a relation and provides a subset of it based on arbitrary criteria.  The structure of the relation is unchanged, but a subset of the correspondences are in the new relation.  In other words, it selects tuples for further processing, and since the relation is a set, it is guaranteed to give a subset out.  It is represented by the sigma character (σ) and the subscript gives the equality condition.

Given the same relation R, we can perform σoffice_id=1(R) and get:

(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)

Simple right?

In SQL terms, a plain selection is what is found in the where clause.  So the equivalent in SQL is simply:

SELECT * FROM R WHERE office_id = 1;

As you can see from these examples, selection and projection are transformed in SQL in some relatively non-intuitive ways.  At the same time, understanding relational databases demands a basic understanding of these two concepts.

Some Notes on Set::Relation


In the course of putting this together I spent a little time with Set::Relation.  I found that the module seems to behave relatively well.  The projection method combined with members() outputs a set.  This was as  far as I took it in my quick spin.  As per the discussion below, Set::Relation operates under the assumption that relations are well-formed sets instead of bags.  This can be very helpful in operating in a highly formal relational environment.

Selection could be done using the semijoin method, or using members() and grep.

Next:  On the Great Null Debate


As a sort of postscript, I have decided not to include strong definitions of joins in this.  The orthography doesn't lend itself well to HTML blogs, and the basic concepts are reasonably straight-forward from an SQL perspective (the exception perhaps being antijoins).  At any rate it is not worth the effort.

Thursday, August 14, 2014

Math and SQL part 4: Basic Data Types. Sets, Tuples, and Bags

A reasonable understanding of the relational model requires understanding the basic data types which make it up, both in the idealized model and in real-world applications.  This post discusses both the idealized model and the accommodations the standard implementations of it make to the messiness of the real world.  We won't deal with NULLs here.  That will be the subject of a future entry.

I Relation and Set


At its root, the relational model is about relations.  Consequently a relation needs to be understood on a mathematical level.  Since a relation is a form of set, let's explore sets first.

A set is an unordered list of unique items.  Thus, {a, b} is a set, but {a, b, b} is not (it is a multi-set or bag, however, see below).  Additionally since it is unordered, {a, b} = {b, a}.  We can project ordering onto a list for our convenience but this does not affect the underlying mathematics.  Sets can be finite or infinite.  In databases, we focus on finite sets (infinite sets can be calculated, but cannot be stored in set form on a finite filesystem).  Infinite sets are for math libraries and don't really concern us here.

A relation is a set of propositions, of correlating facts.   These are represented in tuples (see below). At this point, just recognize that a tuple represents a correlation between two or more facts.  The individual facts are also trivial functions of the tuple as a whole, because each tuple is necessarily unique, each fact is a function of the correlation.

Consider a point in the format of (x, y). For any given point, x and y are trivial functions of it.  For (2, 2), x=2 is trivially true, as is y=2.  This seems tautological (and on one level it is) but it is also important because it gets to the basic operation of a relational database.  For more on this, see part 2 of this series.

II Row and Tuple


A tuple is an ordered set of items.  Unlike a set the order is important and repetition is allowed.  (1, 2, 3), (1, 3, 3), and (1, 1, 1) are all valid tuples.  In tuples, the order ascribes a value to a meaning.  For example, if the preceding tuples referred to points in three dimensional Euclidean space, the first would correspond to the x axis, the second to the distance along the y axis, and the last to the distance along the z axis.   This means that points (1, 2, 3) and (2, 1, 3) are different, and not equal (if they were sets with the same members they would be equal).

While relations are open ended sets (which presumably can be added to indefinitely), tuples are closed-ended.  While one can add more members, this involves a semantic shift in the data.  For example, points (1, 2) and (1, 2, 3) share the same x and y axis but a two dimensional space is different from a three dimensional space semantically.

Relational algebra works on the set level, and perform set operations.  They do so by providing basic set operations across an entire set.  However the comparisons are done on tuples and their members. 

This point cannot be overemphasized.  Ideal relational databases operate on sets and tuples.  Everything else is secondary.  A well designed database will build on this basic point.  As a caution, SQL works on bags, not sets, so some care is needed to make this happen.

III Tuples, Sets, and Relations in Perl


Tuples and sets of course are generally useful in programming.  Understanding these is often a key to better performance, and in Perl this is somewhat counterintuitive.

A naive view from looking at the database would be to see a tuple as a %hash and a set as an @array.  In fact, it is the other way around.  Understanding this is helpful because if you really want a set, a hash is often a better tool than an array in Perl.

Arrays in Perl are ordered and allow for duplicates.  In this regard they are a tuple or a sequence.  Hashes on the other hand do not allow for duplicate keys and the keys are unordered.  Therefore, where you want a set, a hash is usually a better tool, and where you want a sequence or a tuple, an array is a better structure.

Suppose we want to determine which members of one sequence are not found in a specific set (currently represented in an array).  We might do something like:

my %sethash;
$sethash{$_} = 1 for @setarray;
my @excluded = grep {not exists $sethash{$_}} @sequencearray;

This will perform much better than:

my @excluded;
for my $element (@sequencearray){
    push @excluded, $element unless grep {$_ eq $element} @sequencearray;
}

In Perl, set operations are usually easier to make perform well when using hash tables instead of arrays for sets.  This is because hash tables are unordered and enforce uniqueness among their keys.  Thus they are better optimized for the sorts of access patterns set operations are likely to use.

IV Set and Bag


SQL for better or worse generally compromises on the set operation and uses multi-sets (or bags) instead.  A multiset is like a set but it allows duplicates.  This is a compromise to allow for real world data to be easily input into the database.  In import routines, this is important because data may not be perfectly unique, and this may require some processing during the import process to sanitize.

Unfortunately, set operations and bag operations are different and produce different results.  As a preview to the next in the series, we could project a series of operations and get duplicates.  For example, suppose we project y from y = x2.  In this approach we will get duplicate values of y, because when we square negative numbers we get the same result as when we square the positive numbers.  In a set operation, we would just discard the duplicates.  In a bag operation, we return the duplicates.

This distinction is key to understanding the difference between databases in the ideal world vs in the real world.

V A Note on Ordering


As you will note, sets are unordered.  Ordinality is not important.   We can order tuples thus turning them into a sequence (for purposes of limit and offset) but when we do so, we can still revert back to a set for further set operations.  Ordering is mathematically complicated in databases, but it is intuitively easy.  The key thing to remember is that ordering creates a sequence from a set, and that we have to create a set from the sequence before we do further set operations.

It is for this reason that the results of the following query will be in an order dependent on the query plan:

SELECT f.foo, b.bar
  FROM (select foo, id from foobar order by transdate desc limit 20) f
  JOIN barbaz b ON f.id = b.foo_id;

We have a subquery which generates a sequence, but it returns a set of the members of that sequence.  Those are then used in further set operations, and the order returned depends on how the planner decides to do those operations.

Next:  Projection and Selection

Monday, August 11, 2014

Math and SQL part 3: MVCC, Immutability, and Functional Programming

While first normal form is pretty much restricted to relational operations, this piece is more general.  It looks at the basic promise of functional programming, and the way that this applies to PostgreSQL.  In the mean time, we will also look at a fully functional approach to a key-value store in Perl as a close analogue.

I The Basic Promise of a Function


We should start by reminding ourselves of what a function represents: the promise that if you know one or more pieces of information, you can derive or look up another piece of information.

That promise is behind all functional programming, and in areas which are close to the math (such as relational databases) it is especially important.  This basic promise is important, but it also changes the way we have to think about programs.  Rather than computer programs doing things, they merely calculate things.  Whatever is done in the real world as a result is a byproduct of that calculation.

In general, a simple way to think of functional programming is whether output is solely dependent on input, but in relational math, this actually becomes a remarkably blurry line.  With data which can be looked up or stored, this line is often drawn at the point of mutability, because otherwise it is very difficult for routines to share data structures.  This means that data and state which cannot change during the function's existence does not count against it being a function.

Now, obviously a functional view of programming is pretty limited.  While a network server might listen to requests and return a function of input, side effects in the control flow become impossible in a purely function-oriented approach.  For this reason many functional programming environments I have worked with have imperative programs which link the calculations to real-world actions and side effects.

II How State Complicates Programming and Testing


Application state is a major source of bugs in any program.  Very often unforeseen circumstances arise because of slow changes the application's internal control data (called "state") undergoes.  These changes are cumulative, and if not carefully controlled, can lead to eventual inconsistencies which result in operational problems.

Testing a function is far easier than testing a mutable relation.  The number of data points to test are lower if you can do bounds tests on each change of state rather than only at the initial checking.  Concurrency is also easier if you adopt a copy-on-write model (instantiating new objects or other data structures when they are modified).  This avoids worrying about locks when sharing access because writes do not block reads.

III A Functional Key Value Store in Perl


In a past post, we looked at a simple key-value store for Perl.  Now it is time to flesh this out and look at immutability in order to make it more functional.  This new version has a number of important tradeoffs.  While it uses more memory than the previous version, it is far more concurrency safe because we copy on write, and it is easier to make guarantees since everything is immutable.

package MyKVS;
use overload 
    '+' => 'add';
    '-'  => 'remove';

sub new {
     my ($class, $initial) = @_;
     $initial = {} unless ref $initial =~ /HASH/ and not keys %$initial;
     bless $initial, $class;
     return $initial;
};

sub add {
    my ($self, $other) = @_;
    my $new = {};
    $new->{$_} = $self->{$_} for keys %$self;
    $new->{$_} = $other->{$_} for keys %$other;
    return $self->new($new);
}

sub remove {
    my ($self, @other) = @_;
    my $new = {};
    for my $k (keys %$self){
        $new->{$k} = $self->{$k} unless grep {$_ eq $k}, @other;
    }
    return $self->new($new);
}

sub get_value {
    my ($self, $key) = @_;
    return $self->{$key};
}

You will then note that the key value store creates a new key value store on each change.  This means that if we go through the public API, the key value store is mutable for its lifetime.  The code above has not been tested, is probably not the best Perl code in the world, etc. but illustrates the concepts of copy on write fairly well.  Nothing above requires any locking under any circumstances.

We can then use it like:

use MyKVS;
use 5.010;

my $kvs = MyKVS->new({foo => 'bar', 1 => 1});

$kvs += {2 => 'foobar', cat => 'dog'};
my $kvs2 = $kvs + {dog => 'lion'};
$kvs2 -= 2;

say $kvs->get_value('cat');
say $kvs2->get_value('dog');

This will print:
dog
lion

The keys in $kvs are now foo, 1, 2, and cat.  Those in kvs2 are foo, 1, cat, and dog.

IV:  Mutability Challenges


One of the major challenges here is that state ends up having to change over time, and often has to be shared.  Functional programming offers a number of solutions to this problem but all involve essentially handing out immutable structures and copying those when changed.  These principles are not entirely at odds with having some sort of shared state.

This has a number of possibilities, which are beyond the scope of this piece, but this notion of copying on write is fundamental to understanding MVCC on PostgreSQL.

So back to the Key Value Store example, suppose we want to add an ability to share key value stores across an entire application.  We may add then two more methods, optionally in a closure, along with a lexically scoped kvs for the module or methods.

Now, since we copy on write, the only place we need to lock is on the update.  Since our locking strategy will depend on our approach.  In this module the store is static to the module and not shared.  This means we do not need to lock.  If we did, however, we would only lock the update function.  No other functions would need a lock.  In such a way, writers may block writers, but writers never block readers (or vice versa).

{    # start closure

my $kvs = __PACKAGE__->new({});

sub get_store {
     return $kvs;
}

sub update {
     my($self, $update) = @_;
     $kvs += $update;
}

}    # end closure 

Now with these methods something magical happens.  When I update, the local reference hashref gets updated, but everyone getting a hashref gets a private copy, which is effectively immutable (unless one breaks encapsulation). Moreover if one wants to update the global store and get a new updated store, it is as simple as:

$kvs = $kvs->update({key => 'value'});

And now we have a crude multi-version concurrency control for our key value store.  If we wanted to take this further we would store the kvs in shared memory and provide appropriate locking.  Note that locks would only be required on write and retrieving the master copy.

V:  How MVCC Works in PostgreSQL


MVCC is a major feature of modern relational database management systems and it refers to the ability to maintain consistency between requests, handing out older versions of data when necessary to show a consistent snapshot at a specific point in time.  That specific point in time differs based on isolation levels.

MVCC works on the same basic principles we have been discussing here.  Data is copied on write (so there is no difference between an update and delete followed by a select, except that integrity checking doesn't happen on the delete portion).  The database keeps the required bookkeeping information needed to give proper views of the information and then delivers it on request.  Thus inserts are not visible until commit, deletes merely mark data as obsolete, and updates do both.  This avoids a large number of concurrency issues by treating data as immutable and subject to garbage collection.

This basic idea is implemented differently in other database systems of course.  Some physically move rows in anticipation of commit or rollback (using a rollback segment), but PostgreSQL simply stores the rows in the tables, whether obsolete or not, and later collects those via autovacuum (or periodic vacuuming).

The same principle in all cases though is a logical copy of the data being made when the application reads it and another being made when the data writes.  The transaction isolation level affects when this copy happens and which copy the application sees, but not the basic process.

Next:  Basic Foundational Data Types: Sets, Tuples, and Bags