Chemical Markup Language - Molecular Convention

28 August 2011

This version:
http://www.xml-cml.org/convention/molecular-20110828
Latest version:
http://www.xml-cml.org/convention/molecular
Authors:
See acknowledgments.
Editors:
Sam Adams, University of Cambridge
Joe Townsend, University of Cambridge

Abstract

This specification defines the requirements of the Chemical Markup Language Molecular convention.

This document describes the concepts which are introduced in the molecular convention, explains how to compose a document that conforms to the molecular convention and illustrates these with examples.


Table of Contents

1. Introduction
    1.1 Notational Conventions
2. Applying the molecular convention
3. Molecule Element
    3.1 Id
    3.2 Count
    3.3 AtomArray
    3.4 BondArray
    3.5 Formula
    3.6 Property
    3.7 Label
    3.8 Name
    3.9 Formal Charge
    3.10 Spin Multiplicity
    3.11 Chirality
    3.12 Spectrum
4. AtomArray Element
5. BondArray Element
5. Formula Element
    5.1 Count
    5.2 AtomArray
    5.3 Concise
    5.4 Inline
5. Property Element
    5.1 Scalar
    5.2 DictRef
    5.3 Title
6 Scalar Element
    6.1 Unit
    6.2 DataType
7 Label Element
    7.1 DictRef
8 Name Element
    8.1 DictRef
9 Atom Element
    9.1 ElementType
    9.2 Id
    9.3 X2
    9.4 Y2
    9.5 X3
    9.6 Y3
    9.7 Z3
    9.8 FormalCharge
    9.9 IsotopeNumber
    9.10 Spin Multiplicity
    9.11 Count
    9.12 AtomParity
    9.13 Label
    9.14 Property
10 AtomParity Element
    10.1 AtomRefs4
11 Bond Element
    11.1 AtomRefs2
    11.2 Order
    11.3 Id
    11.4 BondStereo
    11.5 Label
12 BondStereo Element
    12.1 AtomRefs2
    12.2 AtomRefs4
    12.3 DictRef

1. Introduction

The molecular convention is used to specify chemistry relating to molecules; for example connection tables formulae, names and properties. The molecules can also contain spectra though these will have their own conventions depending on whether they are Infra Red, NMR etc.

Where the behaviour of an element or attribute is completely explained by the schema it is not further elaborated on in this document; typically in these cases an entry will only state whether the node is required, suggested or optional.

Except where they are expressly forbidden, the convention allows users to optionally include both other cml elements and attributes, and foreign namespaced elements and attributes. It is expected that in general tools will silently ignore the extra information because they will not be able to understand it.

1.1 Notational Conventions

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [ IETF RFC 2119 ].

The terms "element", "attribute", "child" and "parent" in this document are to be interpreted as described in the W3C Recommendation for Extensible Markup Language (XML) [ W3C XML ].

The use of fonts is as follows:

1.2 Namespaces

This specification uses the following namespaces and prefixes to indicate those namespaces:

Prefix Namespace URI Description
cml http://www.xml-cml.org/schema Chemical Markup Language elements
convention http://www.xml-cml.org/convention/ Standard Chemical Markup Language convention namespace

2. Applying the molecular convention

The molecular convention MUST be specified using the convention attribute on a molecular element or a cml element. The value of the attribute MUST be a QName that represents the molecular convention, i.e. convention:molecular. If the molecular convention is specified on a cml element then that element MUST have at least one child molecule element that either has no convention specified or specifies the molecular convention.

<molecule xmlns="http://www.xml-cml.org/schema"
xmlns:convention="http://www.xml-cml.org/convention/"
convention="convention:molecular" id="m1">

    <!-- body is omitted. -->

</molecule>

<cml xmlns="http://www.xml-cml.org/schema"
xmlns:convention="http://www.xml-cml.org/convention/"
convention="convention:molecular">
    <molecule id="m1">

        <!-- body is omitted. -->

    </molecule id="m1">
</cml>

<cml xmlns="http://www.xml-cml.org/schema"
xmlns:convention="http://www.xml-cml.org/convention/"
convention="convention:molecular">

</cml>

<formula xmlns="http://www.xml-cml.org/schema"
xmlns:convention="http://www.xml-cml.org/convention/"
convention="convention:molecular">

    <!-- body is omitted. -->

</formula>

3. Molecule Element

Within the molecular convention, a molecule is REQUIRED to be a child of either cml or molecule elements.

A molecule in the molecular convention is used to hold any combination of:

  1. a string representation of a molecule (be it a name held in a name, a label held in a label or an inline representation in a inline attribute on a formula element).
  2. the chemical composition of a substance, for example using the concise attribute on a formula element, or by specifically listing the atoms in either an atomArray as a child of a molecule or in an atomArray as a child or a formula.
  3. a connection table of atoms, connected by bonds. By definition a molecule is a connected set, therefore hydrochloric acid (bonded) would be a single molecule, whilst H+ Cl- could be represented as a parent molecule containing two child molecules (one containing H+, the other containing Cl-).

3.1 Id

A molecule element MUST have an id attribute, the value of which MUST be unique amongst the molecules within the scope of the document.

The value of the id attribute MUST start with a letter, and MUST only contain letters, numbers, dot, hyphen or underscore.

IdStartChar ::= [A-Z] | [a-z]
IdChar ::= IdStartChar | [0-9] | "." | "-" | "_"
Id ::= IdStartChar (IdChar)*

3.2 Count

A molecule that is a child of another molecule MUST have a count attribute specified. The value of this attribute MUST be a non-negative number.

<cml:molecule convention="convention:molecular" id="parentMol">

    <cml:molecule id="childMol1" count="1">

        <!-- body is omitted. -->

    </cml:molecule>

    <cml:molecule id="childMol2" count="0.5">

        <!-- body is omitted. -->

    </cml:molecule>

</cml:molecule>

<cml:molecule convention="convention:molecular" id="parentMol">

    <cml:molecule id="childMol1">

        <!-- body is omitted. -->

    </cml:molecule>

</cml:molecule>

A molecule that is not a child MUST NOT have a count specified.

<cml:molecule convention="convention:molecular" id="parentMol" count="2">

    <!-- body is omitted. -->

</cml:molecule>

3.3 AtomArray

A molecule MAY contain a single atomArray child except when it contains child molecules.

<cml:molecule convention="convention:molecular" id="parentMol">

    <cml:molecule id="childMol1" count="1">

        <!-- body is omitted. -->

    </cml:molecule>

    <cml:atomArray>

        <!-- body is omitted. -->

    <cml:atomArray>

</cml:molecule>

3.4 BondArray

A molecule MAY contain a single bondyArray child provided that it does not contain child molecules.

3.5 Formula

A molecule MAY contain any number of formula children.

3.6 Property

A molecule MAY contain any number of property children.

3.7 Label

A molecule MAY contain any number of label children.

3.8 Name

A molecule MAY contain any number of name children.

3.9 Formal Charge

A molecule SHOULD have a formalCharge attribute specified. For molecules that have child molecules, the value of the formalCharge SHOULD be equal to the sum of the formalCharges of the child molecules multiplied by the count value of those molecules.

The value of the formalCharge attribute on a molecule that does not contain child molecules is less well defined. In general its value is more important than that of individual atoms (i.e. a cyclopentadienyl anion would have formalCharge="-1" on the molecule but not necessarily have a formalCharge attribute on any of the atoms).

3.10 Spin Multiplicity

A molecule SHOULD have a spinMultiplicity attribute specified.

3.11 Chirality

A molecule MAY have a chirality attribute specified.

3.12 Spectrum

A molecule MAY have child spectrum elements. Each spectrum element MUST specify a convention using the convention attribute to which they conform.

4. AtomArray Element

An atomArray element MUST be a child of either a molecule or a formula element. The atomArray is simply a container for atoms.

An atomArray element MUST contain at least one child atom element.

5. BondArray Element

A bondArray element MUST be a child of a molecule element. The bondArray is simply a container for bonds.

A bondArray element MUST contain at least one child bond element.

5. Formula Element

A formula element MUST be the child of either molecule or formula elements.

A formula MUST have at least one of an atomArray child, a concise attribute and an inline attribute.

5.1 Count

A formula that is a child of another formula MUST have a count attribute specified. The value of this attribute MUST be a non-negative number.

<cml:molecule convention="convention:molecular" id="ml">

    <cml:atomArray>

        <cml:atomArray count="1">

            <!-- body is omitted. -->

        </cml:atomArray>

    </cml:atomArray>

</cml:molecule>

A formula that is not a child of a formula element MUST NOT have count specified.

<cml:molecule convention="convention:molecular" id="ml">

    <cml:atomArray count="1">

        <!-- body is omitted. -->

    </cml:atomArray>

</cml:molecule>

5.2 AtomArray

A formula element MAY contain a single atomArray element.

5.3 Concise

A formula element SHOULD have a concise attribute if possible, i.e. if it can be calculated from the atoms in the formula's atomArray or potentially from the parent molecule's atoms.

The concise attribute is used to hold an (unstructured) formula i.e. no submolecules. The schema defines the allowed pattern for the concise attribute.

5.4 Inline

A formula element MAY have an inline attribute. The inline attribute can be used to hold any information. There is no fixed way for markup to be specified but it is recommended that Latex style is used i.e. H_{3}O^{+} to represent the hydroxonium ion.

5. Property Element

A property element is used to wrap a scalar and define to what the scalar value relates.

5.1 Scalar

A property MUST have a single scalar child that gives the value of the property.

5.2 DictRef

A property MUST have a dictRef attribute, the value of which is a QName referencing an entry in a dictionary which defines how this property should be interpreted.

5.3 Title

It is RECOMMENDED that property elements have a title attribute intended for human-readability.

The title attribute MUST NOT be empty and MUST contain at least one non-whitespace character.

The value of the title attribute MAY contain any valid unicode character, however it is RECOMMENDED that any character from outside of the ASCII subset (codepoints 32-127) is represented using an entity reference.

6 Scalar Element

6.1 Unit

A scalar MUST have a units attribute, the value of which is a QName referencing the units of the value defined using the scalar.

6.2 DataType

A scalar element MUST have dataType attribute, the value of which is a QName referencing the data type of the value defined.

7 Label Element

The semantics of the label are not defined in the schema but are normally commonly used standard or semi-standard text strings.

7.1 DictRef

A label MUST have a dictRef attribute, the value of which is a QName referencing an entry in a dictionary which defines how this label should be interpreted.

8 Name Element

A name element contains a string that is the chemical name of the molecule. The name does not need to be a structural chemical name. It is RECOMMENDED that formatting and foreign (non-ASCII) characters are encoded using Latex style markup.

8.1 DictRef

A name MUST have a dictRef attribute, the value of which is a QName referencing an entry in a dictionary which defines how this name should be interpreted.

9 Atom Element

An atom MUST be a child of atomArray.

9.1 ElementType

An atom MUST have an elementType attribute.

9.2 Id

An atom MUST have an id attribute it is part of an atomArray in a formula (when the id is optional).

<cml:molecule convention="convention:molecular" id="ml" formalCharge="1">

    <cml:atomArray>

        <cml:atom elementType="H" id="a1"  formalCharge="1"/>

    </cml:atomArray>

</cml:molecule>

<cml:molecule convention="convention:molecular" id="ml"  formalCharge="1">

    <cml:atomArray>

        <cml:atom elementType="H"  formalCharge="1"/>

    </cml:atomArray>

</cml:molecule>

<cml:molecule convention="convention:molecular" id="ml"  formalCharge="1">

    <cml:formula concise="H 1 1">

        <cml:atomArray>

            <cml:atom elementType="H" formalCharge="1" />

        </cml:atomArray>

    </cml:formula>

</cml:molecule>

The value of the id MUST be unique amongst the atoms within the eldest containing molecule.

The value of the id attribute MUST start with a letter, and MUST only contain letters, numbers, dot, hyphen or underscore.

IdStartChar ::= [A-Z] | [a-z]
IdChar ::= IdStartChar | [0-9] | "." | "-" | "_"
Id ::= IdStartChar (IdChar)*

9.3 X2

An atom MAY have an x2 attribute, the value of which is used for displaying the object in 2 dimensions. This is unrelated to the 3-D coordinates for the object.

If a x2 attribute is present there MUST also be a y2 attribute.

9.4 Y2

An atom MAY have an y2 attribute, the value of which is used for displaying the object in 2 dimensions. This is unrelated to the 3-D coordinates for the object.

If a y2 attribute is present there MUST also be a x2 attribute.

9.5 X3

An atom MAY have an x3 attribute, the value of which is the x coordinate of a 3 dimensional object. The units are Angstrom and the axis system is always right handed.

If a x3 attribute is present there MUST also be a y3 and z3 present.

9.6 Y3

An atom MAY have an y3 attribute, the value of which is the y coordinate of a 3 dimensional object. The units are Angstrom and the axis system is always right handed.

If a y3 attribute is present there MUST also be a x3 and z3 present.

9.7 Z3

An atom MAY have an x3 attribute, the value of which is the z coordinate of a 3 dimensional object. The units are Angstrom and the axis system is always right handed.

If a z3 attribute is present there MUST also be a x3 and y3 present.

9.8 FormalCharge

An atom MAY have a formalCharge attribute.

9.9 IsotopeNumber

An atom MAY have an isotopeNumber attribute.

9.10 Spin Multiplicity

An atom MAY have a spinMultiplicity attribute.

9.11 Count

An atom that is an ancestor of a formula MAY have a count attribute. If it does not have a count attribute it is assumed to be present only once.

9.12 AtomParity

An atom that is not an ancestor of a formula MAY have a atomParity element child.

9.13 Label

An atom MAY contain any number of label children.

9.14 Property

An atom MAY contain any number of property children.

10 AtomParity Element

An atomParity element MUST be the child of an atom. The atomParity defines the stereochemistry around an atom centre.

10.1 AtomRefs4

An atomParity MUST have an atomRefs4 attribute, the value of which MUST be the space separated ids of four different atoms which MUST be in the same overall parent molecule as the atomParity.

11 Bond Element

A bond element MUST be the child of a bondArray. In the molecular convention a bond MUST be between only two atoms and these atomss MUST (by definition) have the same molecule parent.

11.1 AtomRefs2

A bond MUST have a atomRefs2 attribute, the value of which MUST be the space separated ids of two different atoms which MUST be in the same molecule.

11.2 Order

A bond MUST have an order attribute. It is RECOMMENDED that the value of this order should not use numeric values. If the value is other the bond SHOULD have a dictRef to add further information.

11.3 Id

It is RECOMMENDED that a bond has an id attribute so that it can be referenced. The id of a bond MUST be unique amongst the bonds of the eldest containing molecule.

The value of the id attribute MUST start with a letter, and MUST only contain letters, numbers, dot, hyphen or underscore.

IdStartChar ::= [A-Z] | [a-z]
IdChar ::= IdStartChar | [0-9] | "." | "-" | "_"
Id ::= IdStartChar (IdChar)*

11.4 BondStereo

A bond MAY have a bondStereo element child.

11.5 Label

A bond MAY have any number of label children.

12 BondStereo Element

The bondStereo element MUST be a child of a bond. bondStereo is a container used to support primarily cis C/trans T and wedge W/hatch H stereochemistry but other forms may also be supported.

12.1 AtomRefs2

If the value of the bondStereo is W or H there MUST be a atomRefs2 attribute present. The value of which MUST be the space separated ids of the two atoms in the parent bond. The order of the ids is important; the first is the sharp end of the wedge or hatch and the second is the blunt end.

If an atomRef2 attribute is present there MUST NOT be an atoRefs4 attribute present.

12.2 AtomRefs4

If the value of the bondStereo is C or T there MUST be a atomRefs4 attribute present. The value of which MUST be the space separated ids of four different atoms. Two of the ids in the atomRefs4 MUST be the ids of the atoms in the parent bond.

The atomRefs4 define a system, if cis this will be syn-periplanar, if trans this will be anti-periplananr. Typically the two central atoms will be bonded to each other (and the bondStereo element will be a child of this bond) with a bond order of D and the two terminal atoms will be bonded directly to these, however this does not have to be the case.

If an atomRef4 attribute is present there MUST NOT be an atoRefs2 attribute present.

12.3 DictRef

If the value of the bondStereo is other the element MUST have a dictRef attribute used add further semantics.

A. References

[RFC2119]
IETF RFC 2119: Keywords for use in RFCs to Indicate Requirement Levels , S. Bradner, March 1997. Available at http://www.ietf.org/rfc/rfc2119.txt.
[XML]
Extensible Markup Language (XML) 1.0 (Fifth Edition) , T. Bray, J. Paoli, C.M. Sperberg-McQueen E. Maler and F. Yergeau, Editors. World Wide Web Consortium. 26 October 2008. This version is http://www.w3.org/TR/2008/REC-xml-20081126. latest version of XML is available at http://www.w3.org/TR/REC-xml.

B. Acknowledgements


Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Unported License.