Tuesday, September 23, 2014

The importance of Data Glossaries

In the Structured Integration approach:


Data is a foundational element, underlying the other three. 

Services (which in my terminology include APIs) exchange data in the form of JSON or XML messages, whose fields/elements must be defined and documented.

Events, no matter whether we deal with fine-grained Domain Events within an application or with coarse-grained events (such as an SAP IDoc published by a middleware adapter) also contain collections of fields.

In this post I am not going to discuss the governance of complex message or event structures (defined via XSD or JSON Schema), but I rather intend to talk about something more basic: the simple data field (generic string, numeric string, or "true"/"false" boolean string).


Documenting simple fields

Every time a simple data value must be handled, three aspects come into play: encoding, syntax, and semantics.

While the binary representation and syntax aspects are well covered by tools, the same cannot be said about the semantic aspect.  Restricting the discussion to integration contexts using self-describing formats, the meaning of a data field is communicated mainly via one of the following:
  • <xs:annotation> elements in the XSD with embedded <xs:documentation> element.  A XSD annotation is typically added as a child element of a <xs:simpleType> declaration, which defined the syntax of the element or attribute
  • "description" fields in JSON Schema definitions
  • description: of API query parameters, very similar same in most API definition notations (RAML, Swagger, etc.)
These documentation elements allow unstructured content and, they may be omitted, or they may be filled with imprecise and inconsistent descriptions, leading to miscommunications between business stakeholders, analysts, and developers.

Clearly communicating the business meaning of data fields that partake any IT-enabled process is key to bridging one of the many instances of the infamous "gap between Business and IT".

In the area of systems integration, the above is true whether the value is part of a JSON API response, is part of a SOAP Body message, or is part a JMS message payload that is being sent out in an event-driven fashion, and so on.

It would be very beneficial if the aforementioned documentation elements were made to mandatory and would refer to a web resource that unambiguously defines the meaning of the data, acting as Single Source of Truth.

 For example, consider the following JSON Schema excerpt:

"replacement_product_id": {
  "description" : "http://mycompany.com/glossary/dataelements/replacement_product_id"  
  "type": "string",
  "pattern": "^(\d)+$"
} 
  
The resource URL in red in the example could be accessing a Glossary API (see resource path /datalements/{element}) rather than link to a static page.  That would open up several possibilities beyond shared documentation, such as schema validation and schema completion, through tools that leverage this API.

The following sections go into some more detail about two distinct key features that Glossary Application could expose via its API:  Data Domains and Semantic Data Elements.

Data Domains

When considering a simple data field definition (which is the subject of this blog post), beyond its immediate syntactical attributes (type, length, pattern, and nullability),  a very important aspect is represented by the set of values that the described field is allowed to assume.

This set of allowed values is sometimes constrained though a value enumeration (e.g., for ISO unit of measure codes), but it is often an open ended value set, like for instance set of customer Ids for all the customers of our company.   Regardless of this, it is important to see the concept of this value set in conjunction with the immediate syntactical attributes of the field, a combination that can be called Data Domain

Looking at individual applications, it is obvious that we end up with many application specific data domains, which can be called Technical Data Domains as they link to technical metadata of specific applications (taking their specific configuration/customizing into account).

Example:
In the world of SAP, this concept appears in the form of SAP Data Dictionary (DDIC) Domains, which may specify a value set either via a code list or by referencing a "check table" (i.e., a SAP table that defines the value set as the set of its primary key values).     SAP data dictionary domain MTART, for instance, defines the set of possible material/product types, which are defined in table T134 (check table for the domain).

Still, this concept can be made more general by abstracting away from any specific application and modeling in each enterprise a set of Business Data Domains, which are application agnostic, where in other words all the relevant values are defined purely in business terms.  For example:

Business Domain     Business Value                        Technical (SAP) Domain     Technical (SAP) Value
product_type       FINISHED_PRODUCT  ==>  MTART               FERT

A Business Data Domain can be part of a Common (a.k.a. Canonical) Data Model, however defining application agnostic data values across the board is a massive undertaking.


Semantic Data Elements

Data domains (as defined above) are not specific enough to convey the semantics of a field.  

A Unit of Measure (UoM) field, for example, can be used in a multitude of business processes, services, APIs, and event messages, but mostly with restrictions of its general semantics.   For example we can have an Order UoM (used in a line of an order) or we can have a Pricing UOM (UoM to which a price is referred).

A Semantic Data Element thus represents a specialization of a Data Domain based on the role that a field based on the element in business processes.
Although the underlying data domain is the same, multiple data elements may be be subject to different business rules.   For example, one company may decide that since it sells only finished products, only "discrete" units of measure like Pieces and Boxes may apply to order_unit fields, and UoM values like Kilograms or Liters do not apply.

Any business-relevant simple value used in BPM process instances, API documents, service operations messages, or event documents, should be associated with a Semantic Data Element that defines its business meaning and associated business rules.

If the Data Elements are strictly based on Business Data Domains (application agnostic), then we have true application independence for these elements, as also the applicable values  are defined by business analysts independently from applications.   This is a characteristic of a "pure" Canonical Data Model.

However, in almost all cases, complete application independence is not realistic given the massive necessary effort and the agility required by integration initiatives.


Glossary Applications

One could define such an application as one that allows to precisely define the key terms that should be shared between business and IT, at different levels of granularity.

As such, the scope of a glossary can range from broad business process areas to fine-grained data element definitions.   At a basic level, even a standard CMS like MS SharePoint can serve the purpose, although such a solution is normally not flexible enough without substantial customization.

On the other hand, commercial business glossary tools are normally sold as part of more extensive data governance and MDM suites, which are acquired to support a wide range of IT initiatives.
A list of criteria for the selection of such tools if given here.

For what concerns using such a tool (also) to support data modeling for integration, the most important requirements should be:
  • Support for hierarchical definitions (e.g., functional areas, data domains, data elements)
  • Support of technical metadata and business rules
  • Functionality exposed via a REST API (for consumption by tools)
  • Collaborative features (to allow multiple stakeholders to define glossary items jointly)
  • Metadata export
  • Ability to run in the Cloud

 

Conclusion

The roll-out of a collaborative Data Glossary application with functionality exposed via an API and integrated with schema authoring tools can be a significant step to align business people, IT analysts, and developers, even if just considered in the context of design-time governance processes.

More opportunities lie ahead if we consider in addition the potential for run-time governance (as implemented at Policy Enforcement Points such as API gateways).  Selective domain value translation and selective data validation against business rules are just two examples.   However, these will be the subject of a future post.




No comments :

Post a Comment