Saturday, September 8, 2012

O/R modelling interlude: PostgreSQL vs MySQL

Every time we see people look at PostgreSQL and MySQL on the internet it falls into a flame war fast.  I think that a large part of the problem is that advocates of these databases look at the other database (and advocacy of the other database) through a specific lens and therefore are unable to understand the rhetoric from the other side.  This is an attempt to cut through some of this and offered in the spirit of suggesting that if we can't understand eachother's views, then instead of discussion all we will get are flames.

This post is not intended to be a post about my MySQL sucks (although I will point out some cases where it is inadequate and other cases where it reduces transition costs and times) or about why you should choose PostgreSQL instead.  I am of course biased, having worked with MySQL for some time but switching to PostgreSQL for important work back in 1999.  I personally don't much like MySQL and so it is worth stating my bias up front.  However, I don't think that prevents me from trying to place it constructively in the marketplace or express a sympathetic understanding for what MySQL has to offer developers.

The endless arguments I think are the result of very specific pressures on the RDBMS market, and PostgreSQL in many ways targets what the system is in theory and what it can be, while MySQL targets how an RDBMS is more typically used when developing software for sale.  MySQL is disruptive only by virtue of being open source.  PostgreSQL is disruptive by design and the principles pioneered on Postgres have found their way into Informix, DB2, Oracle, and more.

A simple description of the difference (and this is not intended as a flame) is:

MySQL is what you get when application developers build an RDBMS.
PostgreSQL is what you get when database developers build an application development platform.

The above is not intended as a flame on either side by any means but it does show where people run into pain on both sides (server-side application-style development on PostgreSQL doesn't work, and trying to use MySQL the way you would use Oracle for data centralization really doesn't work either).

In case people are wondering, I intend in the future to look at object-relational modelling potentials in DB2 and Oracle in contrast to PostgreSQL as well,  These products however are more similar than are MySQL and PostgreSQL both in terms of use case and market position. 

App-Centric vs Data-Centric

The whole purpose of a database management system is to store data so that it can be re-used.  This is true regardless of whether you are using a NoSQL solution, a light-weight quasi-RDBMS like sqlite, or a heavy duty system like Oracle.    The type and scope of that re-use varies quite a bit between products though.  Larger-scale RDBMS's typically focus on flexible output and rigid, validated input, and this is more important as more applications start to write to the database.  Smaller databases and NoSQL databases tend to place the application in the driver's seat and do less validation.

These approaches exist on a continuum of course but in general, you need more type checking the more applications may be writing to the database.  With loose checking, the assumption can be made that the database is primarily a store of private state information and therefore ideosyncracies of the application don't matter that much, but when you have multiple applications writing to the same relations, the data is "public" in a way that it is not otherwise and therefore there is a lot of value in ensuring that application ideosyncracies do not cause misbehavior in other software or cause data to be misinterpreted.

So on one side we have big, shared data solutions like Oracle and DB2.  On the other, we have applications which are fundamentally designed to be a data store for a single application, where the developer can be given total freedom regarding data validation.  Many of the NoSQL solutions fit here.  Some SQL-like solutions are also on this side, such as SQLite.

We can look at how MySQL and PostgreSQL fit on this continuum and try to get a feel for why the different points of view seem to be so difficult to bridge.  We will look particularly at solutions given to past data integrity concerns and what those mean for information management and application development.  In PostgreSQL we will look at user-defined function enhancements and replication.  In MySQL we will look at strict mode.

These solutions show radically different views of how the software is expected to be used.  PostgreSQL may be characterized as conservative but innovative, and being unwilling to do anything that might prejudice data in multi-application environments.  MySQL on the other hand my be characterized as focusing on the needs of the app developer sometimes to the exclusion of the needs of the DBA.

MySQL on the app vs data spectrum:  SQL Mode Salad

MySQL 4.x and earlier was notoriously loose with data constraints.  Zero dates were seen as valid, as was a date like '2008-02-30.'  Data would be truncated to fit fields, or otherwise transformed.  These transformations were sometimes lossy but were predictable and in fact make some sense in a content management environment (my first use of any RDBMS was MySQL for light-weight content management in this timeframe).  Typically the data stored was not terribly important (unlike accounting systems or the like) and so it was good enough for its use case.  It may be extreme to call data truncation a feature, but the truth in the initial use case for MySQL is that would not be entirely inaccurate.

The big problem though was that people were trying to build applications to do other things beyond the initial use case, and in many of these cases database transactions were needed and better type checking was needed.

To address these problems, MySQL leveraged its pluggable table system to allow third party vendors to create open source or dual-licensed (with a license from MySQL) transactional tables.  InnoDB, BDB, and a few other tables arose in this way.  Of course if transactions are handled at the table level, and tables are handled by plugins, then data definition language statements can never be transactional.   This isn't the end of the world for many MySQL deployments (for reasons stated below) but it also doesn't handle the type checking issues, which have to be handled before storage.

To handle the type checking issue, MySQL implemented the concept of SQL modes, which allows one to select a number of options which then define a dialect of SQL for use in the software.  In this way one can ease porting from other systems to some extent, and address questions like how strictly types should be checked.

MySQL allows any user to set the SQL model for the session, and this can have effects ranging from SQL syntax to whether '2008-02-30' is accepted as a valid date.  In essence in MySQL, the application is king and the db a humble servant.

It is worth noting that the original few modes have been expanded into a very large set, allowing applications to tell MySQL to accept syntactical ideosyncracies of other RDBMS's.  This sort of thing shortens the time necessary to initially port an application to MySQL and this id great as long as certain boundaries are maintained.

This sort of assumption only really works in practice where only one application is writing to a given set of relations.  If you have two or ten applications reading and writing the same tables, then each one of them can decide what sort of data assumptions the server should use when validating the data prior to storing it.  MySQL thus trades robustness and guarantees on output of data for flexibility of input (this a the fundamental tradeoff in NoSQL as well), and this therefore ends up relegating the database to an application's private data store.

The typical response I have received when asking how to manage this is to put an API layer in the application which does the data checking.  In other words, the application, not the database, is expected to be the gatekeeper regarding valid data.

MySQL users think in terms of the "software stack" with MySQL existing just above the operating system.  The sole purpose of MySQL is to store data for the single application owning the relation.  MySQL users generally do not think in terms of shared data between applications using SQL as an API, and the MySQL developers happily provide the software they are looking for.  This software sits  somewhere between traditional RDBMS's and NoSQL systems in terms of flexibility of data in vs out.

MySQL does not provide facilities for DBA's to restrict which SQL modes are available.  This more or less prevents MySQL from outgrowing the single application per table use case, and it prevents MySQL from  being a genuine information management solution.  That isn't to say it is a bad development tool.  However it should be seen as an RDBMS-like application back-end, rather than a classical RDBMS (which has generally been geared towards centralized data management).

PostgreSQL on the app vs data spectrum and New features:  No Inherit Constraints and Named Arguments in SQL Functions

PostgreSQL began life as the Postgres project out of UC Berkeley.  It was initially a research testbed for advanced database concepts, namely those called "object-relational" in terms of data modelling.  The idea is that more complex problems can be modelled when behavior is tied to data structures which can then be relationally manipulated based on that behavior.

The basic promise of object-relational database management is that data structures can be tied to processing routines, so that more advanced models can be built and relationally queried.   This allows more complex applications to be written without overly burdening the database system with large amounts of data transfer or large result sets in memory.  With plain relational processing, we'd have to express what we can relationally and then filter out the excess in our application.  With object-relational modelling we can build more advanced filters into the SQL queries without affecting readability.  In order to do this, PostgreSQL allows for user defined functions to be written in a variety of languages with C, SQL, and PL/PGSQL being available on the default installation.  Other languages can be added using a plugin system (something MySQL has as well, but not well suited there for Object Relational management of data).

As we have been looking at, PostgreSQL provides powerful capabilities of modelling data which goes well beyond what data is stored and includes the ability to model data derived from what is stored.  PostgreSQL in fact pushes the envelope in this area even beyond where Informix, DB2, and Oracle have taken it.  It is thus a platform for building highly intelligent models of data.


All current major object-relational database implementations (including those of DB2 and Oracle) are at least inspired by Michael Stonebraker's work in this area, and by Postgres in particular.  Indeed, Informix's object-relational capabilities started life as a Postgres fork named Illustra, and both DB2 and Oracle more or less implement the O-R bindings to SQL developed on that platform.  PostgreSQL is different SQL-wise in part because Informix split prior to Postgres adopting SQL as a query language. 

Centralizing data modelling code in the database is generally a net win.  Result sets from SQL queries tend to be smaller, there are fewer round trips between the application and the database, and the models themselves can be used to create a public API, similar to the way an ORM might be used, thus allowing centralization of key code across all applications that may read or write to a specific database.  This doesn't mean "all logic belongs in the database."  Rather it provides for the ability to build more advanced (and efficient) data models which can place processing where it is most efficient and avoid the worst of the scalability bottlenecks.

PostgreSQL has always had a data modelling focus rather than the application back-end focus seen in MySQL.  When we look at two recent additions we can see that legacy as well.   In this model, guarantees regarding data is paramount, and applications are expected to use the relations as, essentially, public API's.  Consequently the database, not the application, is responsible for data consistency and semantic clarity.  Foreign keys for example, are never ignored by PostgreSQL (in MySQL, handling is dependent on table type), and the 30th of February is never a valid date no matter how much the application would like to use it as such.  Dates are always in the Gregorian calendar where this is not valid.  If you care handling Gregorian to Julian conversions you will have to do this for your locale perhaps as a custom type.

PostgreSQL 9.2 will add the ability to use named parameters in the body of SQL-language functions.  This is a huge win for us with the LedgerSMB project and makes our lives a lot easier.  The change is backwards-compatible so our existing functions will just work, but it allows for clearer SQL code in function bodies for functions that may sometimes be able to be inlined.  In terms of object-relational modelling this is a big deal.  However in our specific case it is a larger issue because we use these to define application API's within the software.  SQL language functions have been a part of PostgreSQL for some time, but have slowly moved towards being extremely useful and powerful.  For example, sometimes they can be in-lined and treated as subqueries and this has important benefits performance-wise.

The addition of named arguments in bodies means one can do functions like:

CREATE OR REPLACE FUNCTION asset_item__search
(in_id int, in_tag text, in_description text, in_department_id int, in_location_id int)
RETURNS SETOF asset_item  AS
$$
 SELECT * FROM asset_item
  WHERE (id = in_id OR in_id IS NULL) OR
                (tag like in_tag || '%' OR in_tag IS NULL) OR
                (description @@ plainto_tsquery(in_description)
                 OR in_description IS NULL) OR
               (department_id = in_department_id
                 OR in_department_id IS NULL) OR
               (location_id = in_location_id OR in_location_id IS NULL);

$$ LANGUAGE SQL;

Instead of

CREATE OR REPLACE FUNCTION asset_item__search
(in_id int, in_tag text, in_description text, in_department_id int, in_location_id int)
RETURNS SETOF asset_item  AS
$$
 SELECT * FROM asset_item
  WHERE (id = $1 OR $1 IS NULL) OR
                (tag like $2 || '%' OR $2 IS NULL) OR
                (description @@ plainto_tsquery($3) OR $3 IS NULL) OR
               (department_id = $4  OR $4 IS NULL) OR
               (location_id = $5 OR $5 IS NULL);
$$ LANGUAGE SQL;

While we are slowly converting our code base to use ORDBMS features, this will help keep things more maintainable.  Of course a more O/R procedure might be:

CREATE OR REPLACE function similar_to(self asset_item) 
RETURNS SETOF asset_item
LANGUAGE SQL AS $$

 SELECT * FROM asset_item
  WHERE (id = self.id OR self.id IS NULL) OR
        (tag like self.tag || '%' OR self.tag IS NULL) OR
        (description @@ plainto_tsquery(self.description) 
              OR self.description IS NULL) OR
        (department_id = self.department_id  
              OR self.department_id IS NULL) OR
        (location_id = self.location_id 
              OR self.location_id IS NULL);
$$;

You can see the difference in clarity.  In fact, in LedgerSMB, we often find ourselves using pl/pgsql solely due to the ability to use named queries in function bodies.  We find this is more robust for what we do, and easier to read as well.

User defined functions have been with PostgreSQL from the very beginning, and in fact are required for doing any significant object-relational modelling.  However they are still evolving because this is where PostgreSQL's focus has always been.  We have seen major improvements here in every major release and the function-handling capabilities of PostgreSQL are among the best of any database I have ever worked with.  With third-party handlers it is possible to use all kinds of languages for functions (including Perl, Python, R, and Java) and these can be incorporated in standard SQL queries.  PostgreSQL's focus is and has always been on doing advanced data modelling where the application is best seen as a consumer of managed data models.

From the perspective of the original Postgres developers and the primary developers today, PostgreSQL is a solution for managing and modelling data, where many applications may write to the same relations, and where the relations function, essentially, as an API.  For this reason, PostgreSQL tends to be far more strict about what it will allow applications to do than engines built on the single-app use case.  SQL, in effect, evokes public API's and it is the job of the RDBMS in such an environment to ensure that the API's behave consistently.

This approach blends the traditional roles of database and middleware because a great deal of business logic can be reduced to questions of the data model itself, what questions are asked of the database and what sorts of responses there are.

The other feature to look at is the addition of non inherited CHECK constraints in 9.2.  In past versions, as we have covered, all check constraints were inherited on all tables.  Again this allows one to use table inheritance safely without all of the same key issues that have plagued before, although I still consider it an antipattern to insert both into parent and child tables.

MySQL and PostgreSQL offer different sorts of flexibility.  MySQL offers tremendous flexibility with what sorts of constraints an application wants enforced at the session level, and what sorts of guarantees a type of table can offer.  These are very useful for some problems.

PostgreSQL offers neither of these, but instead offers flexibility in building advanced data models.

MySQL is designed with the idea that applications provide logic and the database provides dumb storage of the application's state.  While this has changed a bit with the addition of user-defined functions and stored procedures, the overall design constrains MySQL primarily to this use case.  This is not necessarily a bad thing as, traditionally, software licensing costs and requirements have often required that even advanced database systems like Oracle are used in this way.  MySQL targets the "my app, my database" world and is usually sufficient for this, particularly when lowest common denominators are used to ensure portability.

PostgreSQL, on the other hand, is designed with the idea that the database itself is a modelling tool, and that the applications interact with it over an API defined in SQL.  Object-relational modelling advocates point out that often getting acceptable performance in complex situations requires an ability to put some forms of logic in the database and even tie this to data structures in the database.  In this model, the database itself is a development platform which exposes API's, and multiple applications may read or write data via these API's.  It is thus best seen as an advanced data modelling, storage, and centralization solution rather than as a simple application back-end.

These differences show, I think, that when PostgreSQL people complain that MySQL is not a "real database management system" and MySQL people dispute this that in fact the real difference is in definitions, and in this case the definitions are deceptively far apart.   Understanding those differences is, I think, the key to making an informed choice.

17 comments:

  1. "Indeed, Informix started life as a Postgres fork"

    I'm afraid your RDBMS industry history is somewhat off. Informix was founded in 1980 (as Relational Database Systems, Inc.). The Berkeley POSTGRES project hadn't even started then because at that time Stonebraker was busy incorporating Relational Technology Inc. to commercialize the INGRES project. The first POSTGRES project paper (The Design of POSTGRES) was published in Nov. 1985. Illustra, which commercialized POSTGRES, was founded in 1992 and it was acquired (the "fork"?) by Informix in 1996, but by then Informix had been marketing its own RDBMS for 16 years.

    ReplyDelete
    Replies
    1. Fixed it. According to what I have been able to find, it looks like the approach that Illustra was taking to Object-Relational database management was more or less merged in and Informix to this date still supports jagged rows when inheritance is used (I think it is the only ORDBMS to do so other than Illustra which they purchased).

      Again, many thanks.

      Delete
  2. Great Article. Enjoyed reading it.
    I am an Oracle DBA. but now i started learning that Postgres is very cool replacement for most of the cases.

    ReplyDelete
  3. Awesome article, written constructively instead of fanboy-wars.

    ReplyDelete
  4. I would be interested to hear any versus type argument with MarkLogic http://www.marklogic.com

    ReplyDelete
    Replies
    1. I haven't worked with that system. Sorry.

      Delete
  5. Given the above, you fundamentally point out the ownership of the data and accompanying rules are commonly best managed by a strict central point, particularly in multiple app environments. Given a lot of those environments are web facing and can rely on multiple parties working together in the life cycle of a task, do you still take this stance?
    Or do you push to rules engine services and other higher level Apis, services, etc to absorb those responsibilites, and adopt the db-is-dumb-object-repo view

    ReplyDelete
    Replies
    1. Daniel: "Best" is a bit of a problematic word here. The simple thing is it depends on what you are doing. Let's take two scenarios here:

      1) Application A's behavior relates to automating object behavior in application B. In this case RESTful web services won't be a good match and neither will db-level automation. You really need something more intimate.

      2) Application A utilizes resources from application B. In this case you could go with DB-level interfaces or RESTful web interfaces. The decision regarding which to go on will depend on the nature of the apps. For example, a mobile timecard app hitting your ERP application across the internet will be different then a cash register hitting the same ERP system. The former is probably cleaner using web services, while the latter is much cleaner using a direct db-level API.

      tl; dr: Layers of API's are best.

      Delete
  6. I like hearing constructive use case analysis like this. Great insight. Thanks for sharing.

    ReplyDelete
  7. This article was immensely valuable to my education. Thank you.

    ReplyDelete
  8. Thank you for the well-constructed insight :)

    ReplyDelete
  9. Excelent post! thank you very much. The article make me think all PG needs is more marketing, like MySQL ;)

    ReplyDelete
  10. I think you've really nailed it with regards to how postgreSQL and mysql are so different -- I'm looking forward to your next posts regarding the use of object-relational usage. I've commented on this blog entry here: http://www.dbspecialists.com/blog/uncategorized/postgresql-mysql-and-oracle/

    ReplyDelete
  11. Great article! This was very insightful, and really helps me design my applications better. I grew up with PHP+MySQL, so I was firmly in the "databases are libraries that help an application" camp. I never considered the whole "public API via SQL" attitude!

    One question here, for those in the know: I'm currently using Microsoft tech quite much, and I'm wondering where SQL Server is on this axis. I have the underbelly feeling that, like many MS products, it tries its best to satisfy both camps and therefore ends up a little in the middle. Can SQL Server be used as effectively as PostgreSQL for the "the DBMS is a public API that talks SQL" approach? If not, why not? Strongly opinionated answers are very welcome :-)

    ReplyDelete
    Replies
    1. Two things are needed to be able to use SQL as a public API: 1) Guarantees that other applications' data will be properly type checked and subject to appropriate domain controls and 2) extensible interface that allows for logic to be plugged into the db level to enforce arbitrary constraints. MySQL can do #2, and is missing only a small component to #1 (it can do the type checking as it stands but cannot offer the guarantee).

      SQL Server offers both. It is very different extensibility-wise than PostgreSQL but it is extensible in its own way and so if you know how to work with it, it can certainly be used to encapsulate access to data for multiple applications to work with.

      Delete
    2. MSSQL is a very capable product. If you have an experieced Ops team that excels at managing Windows servers, you will be fine. It has very good management and scalability tools, With the advent of managed code in UDF's and sprocs it's a very developer-friendly platform.

      I seldom use it anymore, but I developed line-of-business software against it for 5 years without issue.

      Delete
  12. datamodelling online training| datamodelling training| call ...
    www.21cssindia.com/courses/datamodelling-online-training-24.html
    Data Modeling Online Training, Data Modeling training, Data Modeling course contents, Data Modeling , call us: +919000444287,dharani@21cssindia.com.

    ReplyDelete