Monday, August 29, 2011

Smallworld Technical Paper No. 4 - Version Management in GIS - Applications and Techniques

by Mark E. Easterfield, Richard G. Newell & David G. Theriault

Abstract

The majority of today's GIS systems adopt an approach to multi-user working which is based upon the standard model underpinning conventional commercial RDBMS systems. This model is appropriate in applications where only the most recent state of the database is of interest, but it does not work well over long transactions, and is inappropriate for manipulating multiple versions. Many GIS activities involve transactions lasting days or weeks, and others could benefit from the ability to develop, compare, and retain or merge different versions of data. This paper proposes an alternative approach to multi-user working, based around deferred detection of conflict. The paper reviews a number of benefits which the approach brings, such as multi-user data capture, developing and assessing alternative designs, and the ability to include external mapbase updates over time. The paper concludes by considering the features required of the underlying system in order to support the approach.

Introduction

The majority of today's GIS systems adopt an approach to multi-user working based on the one designed for conventional data processing requirements. This approach is based on the principal that it is only permissible to make changes to the most up to date version of the data, a rule which is satisfied by a combination of locking and enforced rollback.

This rule is entirely appropriate in situations where changes tend to be small, where the data represents a single state, and where conflict must be detected at the earliest opportunity. A good example is a ticket reservation system, where it is vital to know with absolute certainty whether a seat has been sold, or is still available.

The approach is much less appropriate to most GIS applications, where changes may be substantial, and be self consistent only over a considerable period of time, and where it may be legitimate to construct alternative versions of data before deciding upon one or another to represent the real world. However, most of the GIS products available currently are locked into the conventional model, as they tend to be heavily dependant upon standard commercial DBMS offerings, which are designed for the DP environment.

The particular deficiency of conventional DBMS systems concerned with their inability to handle long transactions has been recognised, particularly where graphical data is concerned. However, attribute data is still generally managed in the traditional manner, and there are problems in maintaining consistency between the two forms of data. Currently, there is little support for the management of alternative versions of data.

This paper examines the benefits of a data management system which supports alternative versions. It illustrates how such a system also provides a much more appropriate multi-user model in the GIS environment, and provides additional spin-off benefits in a distributed environment.

The paper goes on to consider the requirements of such a system, independent of its implementation, and concludes with one possible implementation strategy.

[ Figure 1 not available ]

The Benefits Of Enabling Alternative Versions

Traditional database technology has encouraged a view of the world, as represented in the database, in which there is a single, canonical state. It is possible to modify the database, from one state to the next, but it is not permissible to have multiple canonical states simultaneously.

At first sight this seems most reasonable there is only one true state of the world (in real life). Furthermore, it precludes incompatible changes from being made in a very simple way there need never be any doubt about the logical integrity of the data. The approach is good at representing a static past, or a present which evolves in small, simple, discrete steps.

However, many applications are actually used to model the future, or to model a loose form of the present in which changes in the real world are not represented as a continuous series of small transitions but rather as a few discreet modifications, each of which is substantial.

When modelling the future, the ability to represent alternatives is of great value. For example, when planning a new development, a number of different possibilities can be reviewed and costed before one is chosen. A system which permits alternative models of the (future) world to co-exist independently of one another must also inherently permit these alternative views to remain separate from the accepted view of the present, so that if there is a canonical view of the world it is not affected by "what if"s.

This requirement to keep a canonical data set apart from changes to it also occurs during customisation. In this case, the alternatives do not necessarily represent real futures at all, but simply experiments to check that the customisation work does what is intended. Traditionally, such verification is performed on a small subset of the data, which has been copied out of the master database. However, this approach tends to conceal problems of scale until the customisation is released, when it is too late to rectify them. Using alternatives, customisations can be tested on the real database, highlighting problems immediately, but without interfering with users in any way.

When modelling the present, there is benefit in being able to defer making individual changes visible until a complete set of changes is in a logically consistent state. In the case of adding a newly acquired data set, or of making substantial changes to the use of a piece of land, say, the data may remain inconsistent for some time. The type and duration of change is more akin to that in a CADCAM environment, rather than the small transitional changes encountered in DP. In an environment which provides alternatives, the canonical state represents one possibility, and the new state resulting from the modifications represents another. The new state can remain hidden from general view for as long as is required, and only released when the whole change is considered complete.

Looking further afield, there are other advantages in an environment which supports alternatives. If a solution has been found to the problem of merging independently developed versions (which is fundamental to the entire alternative strategy), then it is also possible to contemplate merging changes which have occurred entirely outside the particular database.

One example is that of the map base vendor, who issues changes to the map base from time to time to a set of customers. Each customer may have wished to modify his existing base over time, to keep it in step with the real world. At the time the new base is issued, some of the changes will have ceased to be relevant, because they have now been incorporated by the vendor, but others, for reasons of timing or of the nature of the data, will still be required. In such a situation, it is necessary to be able to identify the differences from the original base, and to re- apply those modifications which are still relevant to the new base.

This is illustrated in Figure 1, where the act of bringing the map base up to date would actually prove undesirable if allowed to proceed in an uncontrolled manner. The alternative management approach permits the user's foreground to be reconciled with the new map base in a controlled fashion, so that day to day users of the data never see a dangerous and misleading combination.

A second example where the comparison and merging of disjoint versions is of value is in a distributed database environment. To be truly distributed, parts of the database must be able to function whilst they are cut off from other parts. This necessarily demands that changes must be able to be made which cannot be wholly checked for logical consistency until all the database is visible again. An approach which accepts the mutual coexistence of different versions, with consistency validation deferred until late in the day, makes this feasible.

An approach which allows alternative (and therefore necessarily conflicting) versions of the state of the database to exist is mutually exclusive to an approach which guarantees a single, fully consistent state. In many DP applications, the latter approach is absolutely necessary. We contend, however, that in GIS applications, trans- actions are naturally long, concurrent alternative scenarios are desirable, and that in practice mutually conflicting changes are either rare or wholly unavoidable, and that therefore a model based upon alternatives is more appropriate.

The last of the three points is the most crucial to this argument. If there is a large amount of avoidable conflict which has to be resolved late in the day, then the amount of work required to create a consistent state outweighs the benefits which have been proposed. We believe, however, that day to day conflict will generally be rare in a reasonably well managed environment, and that the other problem of substantial unavoidable conflict is one which is outside the scope of conventional database management techniques.

The Relationship Between Versions And The Management Of Time

The management of versions and the managment of time are often referred to synonymously. However, the two are not the same, and should not be confused (Langran 1988).

Version management, with which this paper is principally concerned, offers control over database time, which runs strictly forwards. As will be seen later, the management of versions may be conveniently implemented by successive overlaying of changes.

Management of real world time is different, because it is possible to modify the past by adding new data (and therefore time in one sense is no longer strictly forwards). For example, the data in a map base may reflect the external world at the current time, say 1990; later on, data may be added which describes what the world was like at an earlier date, for example in 1953. Database time and real world time are thus running in opposite directions.

It is possible to implement mechanisms which manage time in a database, but the methods described here are not generally appropriate. As with alternative version management, an efficient solution demands functions built into the database manager itself which are outside the scope of this paper.

The Requirements Of A System Providing Alternative Versions

Any system which permits alternative states of data to exist concurrently must minimally provide a certain set of features. These are considered in this section, and comprise:

  • The ability for different users to work on seemingly wholly independent views of the database, so that other users remain unaware of the changes they are making
  • The ability to commit changes to disk without the changes becoming visible to other users
  • The ability to make changes potentially visible to other users when they have been authorised as a consistent set
  • The ability of a user to defer, until it is convenient, the making visible of changes effected by someone else
  • The ability to determine precisely what has changed between two versions
  • The ability to validate the consistency of a set of changes to the database.

The first point is an underlying pre-requisite for alternative version management. Conventional GIS solutions to the problem of data access which are based on check-out and check-in suffer from restricting access through excessive locking of data, perform poorly during the check-in/check-out operations themselves, and are expensive in their use of disk. A system providing effective alternative version management demands solutions more directly in tune with that specific approach.

The separation of the action of committing changes to disk, to that of making changes visible to others, is actually more related to long transactions than to alternative management per se. However, since an alternative itself may evolve through a number of states, and may exist for an indefinite period of time, there is the de facto requirement to be able to save individual states of an alternative while it is still hidden from general view.

The traditional DP approach to transaction management combines the act of committing to disk, and that of making changes visible to others. It also incorporates rollback using the same technique, for good measure. These three functions, although related, do not have to be combined, and can actually be considered separately. In an environment which supports a number of individually identifiable alternatives, any one can be committed to disk and rolled back without affecting the others. The act of making the changes visible to a wider audience then becomes a separate administrative function, rather akin to the formal approval and release of drawings or documents.

In an environment where conflicting alternatives may co-exist, the merging of two such alternatives requires a high degree of formality. Three distinct phases are required. The first, making a set of changes potentially visible, has already been described. However, before a user working on another alternative can be allowed to see those changes, he too must take positive action, to prevent the view of the database he is using suddenly going from a self consistent state to an inconsistent one. Just as the user has been buffered from changes going on elsewhere up until now, so he should continue to be protected until there is a good chance that the uptake of the new view will not cause conflict. This process of accepting a set of changes which have been issued from another alternative is called refreshing. The refresh process itself consists of two parts, the first validating that the new combined state is logically consistent, and the second actually making the combined state the one to be used.

When two alternatives are developed independently, the only data which can truly be considered common between the two is that which existed in a common ancestor. The difference between a pair of alternatives is a function of the modifications made to that ancestor in each case. And as these modifications are made in isolation of one another, there is no physical relationship between them, so they may only be compared by considering what logical changes have been made. In particular, it is not possible to simply physically merge two versions together, it is necessary to re-apply the logical changes that have been made.

Thus the ability to refresh demands the ability to determine what changes have been applied to the database during the life of an alternative version.

Before a refresh can be completed, the new view of the database which would be created by incorporating the changes of one alternative into the changes of another must be validated to determine whether any integrity constraints have been broken. In databases based on the relational model, two sorts of conflict can arise. There is a conflict if the same record (identified by primary key value) has been modified in a different way in the two alternatives; and there may be a conflict arising from an application constraint. The former is easy to detect, the latter more difficult, requiring application specific code. However, things are not as bad as they may seem, since if the application possesses the ability to verify the integrity of data generally (for example, as it is entered) then precisely the same verification functions can be used on the set of changed records during the refresh process.

The detection of either form of conflict can be built upon the ability to retrieve the set of records which have changed between the version being refreshed and the common ancestral version. The ancillary problems of how to report or display conflict, and of providing tools to rectify the situation, fall into the application and user interface domain, and are not considered here.

The foregoing has described a set of requirements for a platform which supports the management of alternatives. We next describe one possible mechanism by which these requirements can be fulfilled.

An Implementation Strategy

Many of the demands made of a system which manages alternative versions of data concern the ability to identify the differences between them. Thus we are considering different states of the data, and, most importantly, what has happened to get from one state to the next.

Using the relational model, the changes between one state and another can be described by the set of changed records, identified by primary key value, and the type of change (insertion, deletion or modification). Non relational models present more problems, because there is no simple way of comparing entities between two independently generated sets of data. This paper only considers the relational case.

The implementation described here is based on a tabular datastore constructed from conventional B*-Trees, with the fundamental versioning mechanism built into the underlying data structure itself. This has the properties of low space overhead and almost zero performance overhead during conventional access. This level of the implementation permits multiple versions of the data to co-exist, with blocks which are common to two or more versions being shared between them.

Note that the approach does not demand that all blocks reside in the same file. In particular, benefit can be gained in the long term by storing a version on a mass storage medium such as CD ROM and maintaining the changes on conventional disk.

In order to determine the differences between a version and one of its ancestors, it is only necessary to examine those blocks in that version which differ from the ancestor's version, rather than all the blocks in the database.

Different versions of the same tree are simply accessed via their different rootblocks, and the action of committing a new version to disk simply involves storing the new root block number in an appropriate place. The old version continues to remain accessible while its root block number is still known.

[ Figure 2 not available ]

Alternative versions form a hierarchy, as illustrated in Figure 3. At the top (the root) is the master database, which would probably correspond to what is considered the canonical state. Nodes which are derived from this could be used, for example, to differentiate between themes (e.g. pipework, electrical, cadastral, etc.), between different spatial areas, or between any other aspects where the probability of conflict is low. At the leaves of the tree, each "user" has an alternative in which to work without conflicting with others. In many cases, the levels of the tree will have close correspondence with levels of approval.

[ Figure 3 not available ]

Changes are posted (i.e. made more widely visible) one level at a time. Provided that the up to date version at the higher level is the one from which the changed version was derived, only application validation in the form of "approval" is required the version itself will be self consistent.

If the version at the higher level has since been updated, the version about to post must first refresh, so that it is the up to date state which is being posted into. If this formality were not undertaken, the intervening changes to that higher level version would be lost.

The posting process can generally be implemented using purely physical means; the refresh process cannot. The refresh process is effected by re-applying into the new state the changes which had been made since the last refresh, by determining which records had changed, and the form of change, and repeating the process.

Of all the processes, refreshing is the most complex and potentially time consuming. However, the time taken is proportional to the degree of change, not the overall size of the database.

The cycle of modification, posting and refresh is illustrated in Figure 4.

[ Figure 4 not available ]

Accessing Conventional RDBMs

A GIS may be called upon to access and maintain data which is handled using a conventional DBMS system. We know of no system around today that supports the sort of functionality described in this paper, so if the functionality is to be provided to the end-user it must be achieved by means of a version managed front end to the non version managed underlying system. This solution is bound not to be wholly satisfactory, and includes such problems as:

  • Finding an optimal way of populating the version managed datastore tables which performs well and doesn't require too much disk
  • Coping with situations where the conventional database is updated by the external access (i.e. outside the version managed environment)
  • Retaining version information once data is posted back to the conventional database

The first of these problems is almost certainly soluble by techniques such as remembering frequently used queries and dynamic downloading of data. The latter two problems are much more difficult to resolve. However, we believe that the benefits of a system which supports multiple versions and alternatives in GIS applications outweighs these potential difficulties.

Conclusions

In this paper we have suggested that the conventional DP transaction approach to multi-user working is inappropriate for GIS applications. There are a number of essential differences between the conventional approach and the one described here. These include the ability to maintain, merge and administer a hierarchy of alternative versions in a controlled manner and to safely allow multi-user access without locking.

We contend that a system which uses deferred detection of conflict to permit alternative versions of data to be worked on concurrently provides many benefits. These include the ability to experiment and cost alternative scenarios, to develop and test customisations on a realistic size of dataset, and to resolve many of the current problems relating to long transactions and distributed working.

We have identified a number of requisites of a system which is to successfully support alternative working, which include the ability to separate the process of making one user's changes visible to another into discreet stages, and the ability to determine what has changed between one version and another.

Finally, we have demonstrated the feasibility of the approach by proposing one possible implementation technique which fulfils these requirements efficiently.

References

Armstrong Marc P. (1988). Temporality in Spatial Databases, Proceedings GIS/LIS'88, San Antonio 1988, pp 880-889.

Barrett, Myles 1989. Object-Oriented Language Extensions for CAD/CAM: Journal of Object-Oriented Programming, Vol. 2, No. 2, July/Aug 1989.

Chance, A., Newell, R. G., and Theriault, D. G. 1990. An Object Oriented GIS Issues and Solutions. Conference Proceedings of EGIS, Amsterdam, April 1990.

Chou H., and Kim, W., 1986. A Unifying Framework for Version Control in a CAD Environment. Proceedings of VLDB86.

Ecklund, D. J., Ecklund, E. F., Eifrig, R. O., and Tonge, F. M. 1987. DVSS: A Distributed Version Storage Server for CAD Applications. Proceedings of VLDB87. Katz, R. H. and Chang, E. 1987. Managing Change in a Computer-Aided Design Database. Proceedings of VLDB87.

Kent, William 1989. An Overview of the Versioning Problem. ACM SIGMOD 1989 Conference in Portland, Oregon. SIGMOD record Vol. 18 No 2, June 1989.

Langran Gail and Chrisman Nicholas R. 1988. A Framework for Temporal Geographic Information. CARTOGRAPHICA Vol. 25, No. 3, 1988 pp1-14.

No comments: