Monday, August 29, 2011

Smallworld Technical Paper No. 8 - GIS Databases are Different

by Peter Batty and Richard G. Newell

Synopsis

There has been much debate in the GIS industry about the suitability of standard commercial database management systems (DBMS's) for use in GIS. Historically most GIS's used proprietary storage mechanisms for their geographical data, as the performance of commercial DBMS's did not really make them a viable proposition. However, in the last few years, advances in hardware and software performance have made it possible to develop GIS products on top of commercial DBMS's. This approach has several obvious attractions.

However, the main argument of this paper is that current commercial DBMS technology has fundamental restrictions for GIS, and that a radically different client-server architecture based on version management has major advantages over traditional DBMS architectures. This new architecture is described and its benefits are explained. Clearly integration with existing databases and conformance to existing standards are also very important, and these issues are also discussed.

The Use of Standard DBMS's for GIS

The attractions of using standard DBMS's for GIS have been described in some detail by various people, including one of the authors of this paper (see Batty (1), Seaborn (2)). In summary, the advantages of this approach are that the GIS vendor should be able to take advantage of functions provided by the DBMS vendor and concentrate on developing GIS functions.

In particular, functions such as security and backup and recovery are well proven in standard DBMS's and the GIS can take advantage of these. The GIS user can exploit existing database skills and use common procedures for many database administration tasks, for both GIS and non-GIS data. Integration between GIS and non-GIS data should be easier to achieve when using this approach. There is increasingly good capability available for integration of multiple DBMS's from different vendors in a heterogeneous distributed environment.

The drawbacks of the standard DBMS approach are perhaps less obvious, with the exception of performance. There is an argument that performance will not be an issue in the longer term because of advances in technology. Certainly performance has improved, and it is now possible to implement a reasonably efficient spatial indexing system on top of a standard DBMS, especially one which supports physical clustering of data. However, the performance issue is more complex than this, and we will discuss later how a client-server version managed architecture can offer an order of magnitude better performance than conventional systems in a networked GIS with a large number of users.

The biggest single drawback of standard DBMS's is their lack of capability in the area of handling long transactions and version management, which is discussed in the next section. Handling long transactions is a fundamentally different problem from that of handling short transactions, which is what standard DBMS's are designed for, and handling the former would require significant re-architecting of existing DBMS products.

Some of the other apparent advantages of standard DBMS's are not as clear cut as they might appear. While the GIS vendor may not have to worry about writing code to cater for some DBMS tasks such as backup and recovery, this advantage may be outweighed by the amount of code which has to be written to implement other functions which are not provided by the standard DBMS, such as the storage and retrieval of spatial data, and the provision of some sort of long transaction handling mechanism such as checkout.

The integration of geographic data into external applications is also not necessarily simplified by storing it in a standard DBMS. Because spatial data types are not explicitly supported by most standard DBMS's, the GIS vendor typically has to use complex record structures and retrieval algorithms for the geographic data, which means that it cannot be easily read or updated by external applications. The external application also needs to understand any special locking mechanisms which are used by the GIS to handle long transactions, since the standard DBMS locking does not handle this. The most obvious short term solution to this problem is for the GIS vendor to provide an API (Application Programming Interface) which can be used by external applications, but that approach can be used equally well whether or not a standard DBMS is used for the GIS data. This is an area where progress is being made by the standard DBMS vendors, as they begin to provide the ability to define new data types and operators within SQL. Capabilities are included in the draft SQL3 standard which will be useful in this respect.

Long Transactions

A database transaction is a group of related updates against a database which form a single "unit of work". Standard DBMS's are designed to handle "short transactions", so called because they usually take a short time - typically a few seconds at most. A simple example of a short transactions in a banking system would be transferring some money from a current account to a savings account. The system first has to subtract the appropriate amount from an entry in the current account table, and then add the same amount to an entry in the savings account table. If the system fails half way though the transaction then the database is in an inconsistent state, and it must be rolled back to the state at the start of the transaction to make sure that the database is still consistent. An inconsistent state could also arise if someone else were able to update the records involved while the transaction was in progress, so these records will be locked and nobody else will be able to update them for the duration of the transaction. Standard DBMS's are designed around being able to handle this sort of transaction in a very robust way which will handle system failures appropriately.

In contrast, a typical GIS transaction is something very different. For example, an engineer designing an extension to a gas utility network will need to digitise a number of pipes, valves and other objects, which may take hours, days or even weeks. While he is in the process of doing this design, the database will not be in a consistent state - if someone else looked at the work he was doing when he was half way through it, it would probably not make engineering sense. So he in some sense needs to have a private copy of the data for the duration of the transaction, which he can make available to other users when he has finished his work. He may also want to create multiple different designs for his network extension and do some analysis on each alternative design before deciding which one to use. Once the work has been carried out, it may be necessary to make some modifications to the database to reflect any changes which were made to the original design when the work was done ("as-built" changes).

This whole process may well take weeks or months, and since separate copies of the data have to be managed, the issue of concurrency control also has to be looked after. Some mechanism is required to handle the situation where multiple people want to work in the same area at the same time. There are two basic approaches to the concurrency problem: optimistic or pessimistic. The pessimistic approach prevents more than one person working on the same area at the same time by locking out any area where a user is currently working. The optimistic approach allows multiple users to work in the same area with no constraints, on the assumption that conflict in what they are doing is unlikely to occur. Conflicts are checked for at the time that the changes are posted to the master version of the database and they can be resolved at that time. The optimistic approach is generally most suitable for GIS, since typically the volume of updates is small in relation to the total amount of data, and also working practices often dictate that there should not be conflicts even if two people are working in the same area.

Checkout

The most common approach which has been used to address the long transaction problem is checkout. In this approach, the user specifies an area in which he wishes to work, and the data selected is copied to a separate working area which is just used by that user. The working area may or may not use the same DBMS and data structures as the master database. Updates are made to this working data set, and when the work has been completed the changes are applied to the master database. The concurrency control mechanism used may either be pessimistic, in which case all data retrieved by a user is locked and can only be viewed by other users, or it can be optimistic, in which case multiple users can retrieve overlapping areas and conflicts are identified when changes are passed back to the master database. For a more detailed description of a checkout based system, see Batty (3).

Checkout has a number of disadvantages. The first is that the initial retrieval of data can take a long time - in general, checkout times tend to be measured in minutes rather than seconds. A second drawback is that the user has a restricted subset of the database to work in. If he discovers that he needs to work outside the area which he originally requested, then it is necessary to do another retrieval, which may again take a significant time. If relationships (including topological relationships) are allowed between objects in the database, this introduces further complications in deciding exactly how much data to check out - some mechanism is required for controlling updates to objects which have been extracted which are related to objects which have not been extracted.

Version Management

Another approach to handling long transaction is to use a version managed database. In such a database it is possible to create different versions of the database called alternatives. Only one user can update an alternative at one time, but any number of users can read an alternative. Changes made by a user within an alternative are only seen within that alternative. The whole database can be seen within an alternative, but data is not replicated: only the changes relative to the parent version are stored in an alternative. To implement this efficiently requires version management to be built in at a fundamental level in the DBMS. When changes to an alternative have been completed, they can be posted up to the parent alternative. An optimistic approach to concurrency control is used, and any conflicts are detected and corrected at this time. It is also possible to have a tree structure of alternatives, so in addition to handling simple long transactions, this approach also provides a mechanism for handling alternative designs in an elegant way. Version management overcomes all the problems with checkout mentioned above: there is no initial retrieval time, no copying of data is required, and the user has access to the whole database at all times. For a more detailed description of version management see Newell (4) and Easterfield (5)

A Client-server Implementation of Version Management

This section looks at the way in which version management is implemented in the Smallworld Version Managed Data Store (VMDS). It is necessary to consider the structure of the database at quite a low level in order to appreciate the difference between this architecture and that of traditional DBMS's in terms of the performance which is achievable. This section will just summarise the most important points about the implementation - for a more detailed discussion see Easterfield (5).

Tables in VMDS are implemented using a standard structure called a B^-tree which allows efficient navigation based on a key value to find a record with that key. Many DBMS's implement their tables using B-trees. However, the key difference between the VMDS approach and more traditional approaches, for the purposes of this discussion on performance, is that the datastore never updates a disk block directly while any version of the database still refers to it. Whenever some data is changed and committed to disk, a copy of the original disk block containing that data is created and the update is made to that copy. In turn, any other disk blocks in the tree which referred to the old disk block are also copied and updated so that they point to the new version of the data. These other disk blocks may still point to other disk blocks which have not been changed - so common data is shared between versions.

This is in contrast to traditional DBMS architectures, where an update will cause the data in the original block in which that data was stored to be modified. In such a short transaction based DBMS, a change to a record is immediately seen by all users. In contrast, when an update is made by a user of VMDS, it is only seen by that user until a version management operation such as a merge is carried out by another user.

A huge benefit of the fact that a disk block is never updated by VMDS lies in the fact that this makes it possible to cache disk blocks on client workstations. In any subsequent processing, if a disk block is required and it has been cached on the local workstation, that block can be immediately used, in the knowledge that it has not been updated on the server. This is not the case with a standard DBMS, since as soon as a block was updated by any user, a mechanism would be required for updating or uncaching all cached copies of that block (in a way which will work consistently even in the event of hardware, software or network failures on part of the system). This is a very complex unsolved problem.

Performance

This ability to cache blocks on the client workstation also means that most of the processing can be done by the client. Thus this approach removes the two main bottlenecks in GIS performance which arise with standard DBMS's - processing on the server and network traffic. The requirement for a typical GIS redraw transaction is to be able to retrieve hundreds or thousands of records, and potentially megabytes of data, from the database in a few seconds. In standard DBMS's, all the optimisation and query processing is done by the server, which quickly leads to a processing bottleneck on the server when processing large numbers of concurrent queries of this complexity. The second bottleneck is that all the data which is returned from the query needs to be transferred across the network, which leads to serious problems in terms of network traffic.

The version managed client server approach offers enormous improvements in both of these areas, as very little processing is done by the server - it is essentially just a very simple fast block server - and typically much (or all) of the data required will be cached on the client, in which case network traffic is reduced and processing on the server is also further cut down. This makes it possible to have very large numbers of users concurrently updating a single continuous database. Tests carried out at one customer showed no significant loss in performance as the number of users accessing a single server on a local area network was increased from 1 to 35. The users were all running an intensive data capture application. The only way in which anywhere near this level of performance could be achieved with the standard DBMS approach is by checking out data and storing it locally on each workstation, but this approach has other significant drawbacks which were discussed earlier.

Distributed Database

A very common requirement for companies who use GIS is to be able to access the GIS data in multiple geographically separate locations, which may not be connected by very fast communications. However, for some applications it may be necessary to view the database as a whole, and people working in one district may occasionally want to look at data in other districts. The disk block caching described in the previous section has a natural extension to this sort of distributed database environment (Newell (6)).

It is possible to have a central master database, and then at each remote location which is accessing the database, have a slave database which stores cached blocks on disk rather than in memory. This persistent cache will store blocks from one session to another, and these cached blocks can be accessed by all users working at this remote site. If users at the remote site typically access a specific subset of the master database (which could be a geographic subset, such as one of ten districts of a utility, or a functional subset of the database, such as water network objects), then the local disk cache will automatically become populated with blocks containing data from that subset. Updates made at remote sites still go directly back to the master database.

This persistent cache approach to distributed database further extends the capabilities described in the previous section, of reducing the load on the central server and reducing network traffic. It provides a very elegant solution to the distributed database problem, in that data is automatically distributed to the appropriate location based on demand. If the usage of the database at a remote location changes - for example, if the areas managed by different district offices are changed - then the data cached at the location will automatically change over time as users look at the new areas.

Once again, this whole approach will only work in a version managed environment because of the problem of synchronising updates to multiple copies of a block in a short transaction environment.

The persistent cache approach can also be used in a multi-level way. A district could have its own large persistent cache, and outlying depots have their own persistent cache. A request for a block from a user in the depot would first check the cache on that client machine, then look in the persistent cache on the depot server, and then the persistent cache on the district machine, before finally looking in the master server.

Benefits of Version Management for System Development

Version management can also be extremely useful when doing development, or when making changes to a production database. Not only can changes be made to data within a version, but it is also possible to make changes to the data model within a version without affecting the data model (or data) in any other versions. This is extremely useful for an application developer, who can test any changes to the data model against the whole database in their own version, before applying the same changes to the master version of the database. Similarly, any operations which do large scale changes to existing data in the production databases can be run in a version and the results can be thoroughly tested before posting the changes to the master version. If any problems are found then that version can just be thrown away without having done any harm to the master version of the database.

Version management is also an important requirement for CASE (Computer Aided Software Engineering) applications, and indeed other aspects of GIS technology are useful in implementing a CASE tool, such as the ability to define objects with data which can be represented graphically, and the ability to maintain complex relationships between them. Smallworld have used this fact to implement a CASE tool which can be used to define and modify data models for GIS applications. This uses version management techniques extensively to help provide functions like the ability to create alternative versions of the data model, apply these to alternative versions of a populated database for testing and development, and finally apply the changes to the master version of the database with a mechanism for propagating the schema changes down to other versions in the database.

Integration

Integration between GIS and non-GIS data and applications is generally accepted as being very important in gaining maximum business benefits from the GIS. So although significant advantages can be obtained by using a new type of DBMS for GIS, it is also very important that any new DBMS should provide good integration with existing DBMS's and applications.

The Smallworld system caters for this by allowing tables to be defined either in the Smallworld version managed data store or in a standard DBMS such as Oracle or Ingres. To both users and application developers, access to either type of table is identical (an example of encapsulation in an object-oriented system). The only differences are that any updates to the tables in a short transaction database will be immediately visible to all users in all versions of the database, whereas updates made to the Smallworld VMDS will only be visible in the alternative in which the change is made. Any distributed functions provided by the external DBMS can also be used to access other DBMS's on remote machines. It is also possible to use SQL gateway products like SequeLink to access remote DBMS's.

External applications may also need to access data stored in the GIS. This can be done by providing an SQL server which allows external applications to query the data using standard SQL. There are some issues with this approach in that standard SQL does not support the spatial operators or data types which are supported by the Smallworld database. However, some standard DBMS's now include the ability to add new operators and datatypes, and the SQL3 standard also addresses these areas. However, there are certainly shortcomings in SQL in this area at the moment. Another option for providing external access to the GIS data is to provide an API which allows queries or commands written in the GIS development language (Smallworld Magik, in the case of the Smallworld system) to be executed by an external application and to get the results back. This provides the potential for executing more complex GIS queries such as network traces, which would be very difficult to express in SQL.

One issue which still requires some careful thought is the exact nature of the integration between long and short transaction data and how commits in both environments interact. Although any table can be stored in either the short or the long transaction database depending on what type of updates are most frequently carried out against that table, there may be occasions where related updates are required across both environments. There may be some cases where controlled replication of certain data in both environments is a valid option.

Acknowledgements

The authors would like to acknowledge the work of the developers of the technology described in this paper. Mark Easterfield developed the Version Managed Data Store and most of the concepts explained in this paper. Nicola Terry developed the Data Dictionary and Gillian Kendrick developed the CASE tool, which were touched on briefly in this paper and will be revisited in more detail in a future paper. Betty Barber developed the capability to integrate external DBMS's with VMDS.

Conclusion

The conclusion of this paper is that, although there are some attractions to using standard DBMS's for GIS, an alternative client server database architecture based on version management has very significant advantages over traditional DBMS architectures for GIS. These advantages are firstly in the area of handling long transactions and managing alternative versions of data, which are critical issues in handling large multi-user GIS databases, and secondly in the area of performance with large numbers of users in a networked environment.

The authors have personal experience of implementing both approaches, and believe that, although progress is being made on some of the shortcomings of standard DBMS's for GIS, it is hard to see how they can provide the advantages of the database architecture described in this paper without some major re-architecting, which seems unlikely to happen. Thus it is our belief that the best way forward is for us to continue to develop the database technology described in this paper, and further enhance its ability to integrate with commercial DBMS's using SQL standards. The alternative is to wait for the day when the DBMS vendors provide equivalent functionality to what we now have.

References

1. Batty P.M. An Introduction to GIS database issues, Proceedings of AGI 92, Birmingham 1992.

2. Seaborn D. 1995: The Year GIS Disappeared, AM/FM XV Conference Proceedings, San Antonio, April 1992 pp 822-826

3. Batty P.M. Exploiting Relational Database Technology in a GIS, Mapping Awareness magazine, July/August 1990.

4. Newell R.G. Theriault D.T. and Easterfield M.E. Temporal GIS - modelling the evolution of spatial data in time, Computers and Geosciences Vol 18 No 4 pp427-433, 1992

5. Easterfield M.E., Newell R.G. and Theriault D.G. Version Management in GIS: Applications and Techniques, Proceedings EGIS '90, Amsterdam, April 1990

6. Newell R.G. Distributed Database versus Fast Communications, AM/FM XVI Conference Proceedings, Orlando, March 1993, pp 647-659

No comments: