Saturday, November 12, 2011

The Case for Intelligent Databases

LedgerSMB, as an ERP and accounting system, has over time begun to move towards an intelligent database design based on PostgreSQL.  We believe that this best matches our long-term goals for the project, especially because interoperability is a very important requirement for any ERP system.  Read more to see our general case in this area and see what other projects may be able to learn from our experience.

A Short Developer's Definition of Intelligent Databases

An intelligent database is a database encapsulated behind a defined API.

A Longer Description

While most programmers primarily use the database  primarily as a place to simply store data, this reduces a database to basically an ACID-compliant data store with some reporting capability.  This approach saves the programmer from worrying about advanced database functionality, but it also prevents the database system from being used to do a number of useful tasks that databases can really be leveraged to do.

This post will explore why LedgerSMB is going this direction, and the sorts of functionality (particularly regarding PostgreSQL) that is a part of this plan, what sorts of tradeoffs we see in it, and where we expect to benefit.  This is not an article that discusses the details of implementation (see previous posts for that), but rather at a high-level how and why.


The Toolbox

The fundamental idea of encapsulating a database inside a defined API is that usually the application should hit the API and not the underlying relations where the data is stored if at all possible.,  The  API acts as an interface between actual storage/retrieval and the application.

The major aspects of an API can include views and/or stored procedures.  We use stored procedures for most things but may include views for some sorts of objects as time goes on.

What do Intelligent Databases Cost?

Many current development environments seek to save the programmer from spending heavy engineering time on a database, particularly in the early stages of a project where requirements can change.  The database is relegated to a simple data store with ACID compliance.  Frameworks like Rails will even generate a database schema from programming data structures.

If one goes with an intelligent design for the database, one cannot use agile programming methodologies to design the database.  Instead the database requires some real engineering in its own right. This takes some time and effort in its own right and can slow down development compared to ORM-based code generators such as we see in Rails.

Such engineering requires taking a clear look at what data is expected to be collected, what internal structure it has, designing the database storage first and foremost around that analysis,and then building an API based on how it will be used.  For simple schema development this doesn't take a long time but the complexity does grow with the data collected.

In the LedgerSMB project this is not the end of the world since we are re-engineering the database incrementally and so each major branch includes at least a few areas where old db structures have been pulled out and new, better engineered ones added.  These areas then have an API and the application generally hits the API not the relations.  Old code areas may hit the API or the relations.

Other projects may find that this adds significant development time beyond what is reasonable, but these must still be weighed against the benefits below.


What benefits do Intelligent Databases Provide?

Intelligent databases offer a number of specific benefits that LedgerSMB tries to capitalize on.  The fact that these designs better match database theory is not enough.  tangible benefits were required for us to reach this decision.

Some of these benefits speed up time troubleshooting and debugging code, partly offsetting the additional investment mentioned above.  Others provide compelling features to third party developers who may want to integrate our software with theirs.

The single most important benefit is that of clarity.  First, with very few exceptions, LedgerSMB tries to keep to one language per file.  This means that database queries are in .sql files, and Perl is in .pl and .pm files.  The modules which do query mapping are the only Perl modules that contain both Perl and SQL.  In this way if I want to know what a query does, I can go to the relevant SQL file and read it.  Using features like named inputs in PL/PGSQL, even 100 line queries can be remarkably readable.  The API is also well defined, allowing us to troubleshoot issues with the query writer.

Clarity also means that performance problems can more easily be reviewed and addressed without getting into messy questions of ORMs and how they generate queries.  The queries that produce a specified result set from an API call can be defined and the actual function written out in SQL.

Along with clarity, we get the ability to do as much as possible with declarative constraints.  This provides, we believe, greater robustness in the application and therefore better software in the long run.

The second important benefit is in the area of integration.  Because the API is defined, there are specified integration points that other applications can use to integrate without worrying as much about ensuring that corner cases are properly covered.  This makes integration more robust.  Moreover important areas like LISTEN and NOTIFY allow other applications to integrate in ways approximating message queues, and the like.

In fact, the integration benefit was the driving force behind the decision to move to an intelligent database design.  We wanted to open up the application to add-ons written in other languages (Python at the time, but others could supported without too much effort).

Addressing Separation of Concerns

We have all heard horror stories of what happens when you put everything in the database.  We have also heard of developers afraid to put anything in the database because of fear of going down the roads of these other horror stories.  The secret to success here is in fact to separate concerns and to separate them well.  Many lines are ultimately arbitrary in detail but bright in outlines and so they require a team dedicated to making decisions carefully rather than drawing bold lines and mindlessly following them.

Of course that's a long way of saying you can't get away with sloppy engineering and have to actually put some thought into things.


Other Options

There are other options for application development. However, they all presume that the application has intimate knowledge of the internals of the database.  I am somewhat puzzled why programmers who are so careful to avoid modules understanding the internals of eachother are so careless about encapsulation in the database.

The main other options are use of an ORM, which typically involves designing the database around the ORM and the data model for the application, and thus the RDBMS becomes basically an extension of the object model.  This limits interoperability, and reduces the role that the database has relative to interoperability with other applications.  This being said there isn't necessarily a problem with using VIEWs and ORMs together to provide this sort of abstraction.

A second possibility is to use a dedicated store, like many NoSQL engines offer where the database really is a store for objects, and where the "schema" (in scare quotes because NoSQL usually doesn't really have a schema concept) is just a direct extension of the object model itself.

The tradeoff with NoSQL is similar to using an ORM and building your database around that ORM, but more extreme.  Typically you see even more rapid development early on, combined possibly with additional scalability in some cases, but with the tradeoff of interoperability is more severe, and ad hoc reporting is generally not practical outside the relational model.

Because of our desire to focus on interoperability and a reliance on good, flexible reporting, these options were not options for our project.  They might be for others though.

Conclusions

LedgerSMB has benefited greatly benefited from the decision to move towards an intelligent database layout.  Reasons include our focus on interoperability, and our willingness to do engineering up front.  Other projects may find their mileage varying.

4 comments:

  1. I'm the chief designer of a Saas ERP system in Italy. Our design decision has been to develop an internal framework in Java, and in 2001, when we started, we developed our own ORM.
    Today I can say that we have a wonderful development environment but I find the ORM sometime a performance limiting and complexity factor. ORMs don't necessarily make it difficult to do a clean relational design. We decided to design an ORM around the database design and not around an OO design, that's why we don't use Hibernate or something like it. But this made our OO design less than perfect. It's a trade-off.
    I've been thinking about potential solutions for this problem and one may be your approach. Personally I find the following limitations in it:
    - portability across RDBMSs. In our case portability has been a winning factor in one big project where we were forced to use MS SQL Server (...) instead of PostgreSQL;
    - SQL is not a "db schema refactoring" friendly language, and in general is not a developer friendly language. Just think about the case of changing a primary key in a table (we don't use the "every PK is surrogate" pattern). All the joins in all SQL involving the table must be changed just for starters. Our framework today allows a change like this to happen nearly transparently, with full compiler checks. Normally SQL can't be checked statically. You may have a big test pile to cope with this problems, anyway;
    - I also find very difficult to write SQL in a way I don't have to "repeat myself" every time I need to embed a piece of knowledge in a statement, like putting a derived value expression in many queries. I don't think views and functions can solve this kind of problems easily (that is without compromising the work done by the optimizer);
    In the end what I'm thinking about is a DSL for database development that is at a higher level of abstraction than SQL but is translated at runtime into SQL. This means, instead of materializing the data model as programming language classes like we currently do with ORMs, and instead of using the database directly, like you do, we may have a middle tier intelligent interpreter that dynamically translates application requests into SQL (or stored procedures) using a sort of knowledge base about the model that is expressed mostly in a declarative way. Our framework today does a lot of this, but classes implementing the model are a bad design, in my experience, and I'm looking forward to get rid of them.
    Maybe I will be able to start this project in a near future.
    And yes, NoSQL may be another more Cloud friendly avenue, but needs taking a much bigger risk for an ERP writer. Maybe a Saas solution like ours may benefit and I'm thinking about this, but is a complete rewrite and needs a complete architectural rethinking of an ERP system.

    ReplyDelete
  2. Thanks for your comment.

    This is one thing I think is worth emphasizing:
    "We decided to design an ORM around the database design and not around an OO design, that's why we don't use Hibernate or something like it."

    One thing I'd ask is whether one might still be better off with some level of encapsulation. For example using VIEWs to mediate betweeen the ORM and a clean relational design.

    "- portability across RDBMSs. In our case portability has been a winning factor in one big project where we were forced to use MS SQL Server (...) instead of PostgreSQL"

    That's a significant tradeoff with a procedural interface. Procedural interfaces are not particularly portable, though I suppose porting between PostgreSQL and Oracle might not be so bad. Certainly going to MS SQL is out of the question.

    "- SQL is not a "db schema refactoring" friendly language, and in general is not a developer friendly language. Just think about the case of changing a primary key in a table (we don't use the "every PK is surrogate" pattern). All the joins in all SQL involving the table must be changed just for starters. Our framework today allows a change like this to happen nearly transparently, with full compiler checks. Normally SQL can't be checked statically. You may have a big test pile to cope with this problems, anyway;"

    Here's what we have done in LedgerSMB about this issue:
    1) Every table has a primary key which is a natural key.
    2) Every table has a secondary, surrogate key used for joins.

    "I also find very difficult to write SQL in a way I don't have to "repeat myself" every time I need to embed a piece of knowledge in a statement, like putting a derived value expression in many queries."

    User defined functions can be used to do value derivation, and you don't lose any planner optimization when one doesn't have to hit a table.

    For example, in our fixed asset depreciation workflow, we have a function which calculates, on a straight-line basis, the amount to depreciate based on inputs, and we call this in the field list of the function. This way we can have several different straight line methods which vary in how time units are handled, but which share basic calculation code. This has no optimizer penalty because the function is immutable (depending solely on inputs) and does not hit underlying tables, so opacity is not a concern.

    Views can sometimes (but not always) solve some of this because views, unlike functions, are transparent to the planner at least in PostgreSQL, so the planner issues which get introduced end up being due to the output of the rewritten query with the view used instead of opacity issues. In other words, at least in PostgreSQL, the performance issues come from the fact that views are developer-opaque, as they are not planner-opaque.

    The above solutions don't cover all of the cases where one might wish to centralize code for maintainability reasons, but they help significantly.

    "And yes, NoSQL may be another more Cloud friendly avenue, but needs taking a much bigger risk for an ERP writer. Maybe a Saas solution like ours may benefit and I'm thinking about this, but is a complete rewrite and needs a complete architectural rethinking of an ERP system."

    I don't think NoSQL is viable in the ERP space because it is extremely inflexible regarding reporting. Where NoSQL really shines is where you have single-purpose applications which have their own integration API and extremely limited reporting needs. The example that comes to my mind would be an LDAP directory. While ERP applications could meet the first prong, I do not think they could ever meet the second. Indeed, while NoSQL systems may have a place in an ERP environment, I believe those would be entirely secondary to the RDBMS (i.e. perhaps data would be exported to such secondary systems).

    ReplyDelete
  3. I think this article is so wrong on so many levels. I tried to post my comments here, but your blog software rejected them for having too many characters. Instead I have posted it on my own website at http://www.tonymarston.co.uk/php-mysql/the-case-against-intelligent-databases.html

    ReplyDelete
  4. Hi Tony:

    your points deserve some more discussion than I can give in a comment. I will probably post a response. For some applications, you may be right, but when one gets into complex line of business applications where data must be re-used, I think a lot of that breaks down. More soon.

    ReplyDelete