-
Notifications
You must be signed in to change notification settings - Fork 151
Description
Summary
Support Extended XYZ officially.
Check whether the existing quip/gap/xyz
meets the specification. If yes, just adding an alias is enough. If not, figure out the difference to see make it compatible with the specification, and then add an alias.
Detailed Description
Below is the extended XYZ specifcation.
Extended XYZ specifcation
General formatting
- Allowed characters: printable subset of ASCII, single byte
- Allowed whitespace: plain space and tab (no fancy unicode nonbreaking space, etc)
- Allowed end-of line (EOL) characters set by implementation + OS
- pure python: whatever is used to return lines by file object iterator
- low level c: fgets()
- Blank lines: allowed only as 2nd line of each frame (for plain xyz) and at end of file
General definitions
- regex: PCRE/python regular expression
- Whitespace: regex \s, i.e. space and tab
Primitive Data Types
String
Sequence of one or more allowed characters, optionally quoted, but must be quoted in some circumstances.
- Allowed characters - all except newline
- Entire string may be surrounded by double quotes, as first and last characters (must match).
Quotes inside string that are same as containing quotes must be escaped with backslash. Outermost
double quotes are not considered part of string value.- Strings that contain any of the following characters must be quoted (not just backslash escaped)
- whitespace (regex \s)
- equals =
- double quote ", must be represented by \"
- comma ,
- open or close square bracket [ ] or curly brackets { }
- backslash, must be represented by double backslash \\
- newline, must be represented by \n
- Backslash \: only present in quoted strings, only used for escaping next character. All backslash
escaped characters are the following character itself except \n, which encodes a newline.- Must conform to one of the following regex
- quoted string: (")(?:(?=(\\?))\2.)*?\1
- bare (unquoted) string: (?:[^\s=",}{\]\[\\]|(?:\\[\s=",}{\]\[\\]))+
- only used in comment line key-value pairs, not per-atom data
Simple string
Sequence of one or more allowed characters, unquoted (so even outermost quotes are part of string), and without whitespace
- allowed characters - regex \S, i.e. all except newline and whitespace
- regex \S+
- only used in per-atom data, not comment line key-value pairs
Logical/boolean
- T or F or [tT]rue or [fF]alse or TRUE or FALSE
- regex
- true: (?:[tT]rue|TRUE|T)\b
- false: (?:[fF]alse|FALSE|F)\b
Integer number
string of one or more decimal digits, optionally preceded by sign
- regex [+-]?+(?:0|[1-9][0-9]*)+\b
Floating point number
- optional leading sign [+-], decimal number including optional decimal point .,
optional [dDeE] folllowed by exponent consisting of optional sign followed by string of
one or more digits- regex
- integer without leading sign bare_int = '(?:0|[1-9][0-9]*)'
- optional sign opt_sign = '[+-]?'
- floating number with decimal point float_dec = '(?:' + bare_int + '\.|\.)[0-9]*'
- exponent exp = '(?:[dDeE]'+opt_sign+'[0-9]+)?'
- end of number num_end = '(?:\b|(?=\W)|$)'
- combined float regexp opt_sign + '(?:' + float_dec + exp + '|' + bare_int + exp + '|' + bare_int + ')' + num_end
Order for identifying primitive data types, accept first one that matches
- int
- float
- bool
- bare string (containing no whitespace or special characters)
- quoted string (starting and ending with double quote and containing only allowed characters)
one dimensional array (vector)
sequence of one or more of the same primitive type
- new style: opens with [, one or more of the same primitive type separated by commas and optional whitespace, ends with ]
- backward compatible: opens with " or {, one or more of the same primitive types (all types allowed in {}, all except string in "")
separated by whitespace, ends with matching " or }. For backward compatibility, a single element backward
compatible array is interpreted as a scalar of the same type.- primitive data type is determined by same priority as single primitive item, but must be satisfied
by entire list simultaneously. E.g. all integers will result in an integer array, but a mix
of integer and float will result in a float array, and a mix of integer and valid strings will
results in a string array.two dimensional array (matrix)
sequence of one or more new style one dimensional arrays of the same length and type
- opens with [, one or more new style one dimensional arrays separated by commas, ends with ]
- all contained one dimensional arrays in a single two dimensional array must have same number and
primitive data type elements, and will be promoted to other possible types if necessary to parse entire
array. E.g. a row of integers followed by a row of strings will be promoted to a 2-d string array.XYZ file
A concatenation of 1 or more FRAMES (below), with optional blank lines at the end (but not between frames)
FRAME
- Line 1: a single integer <N> preceded and followed by optional whitespace
- Line 2: zero or more per-config key=value pairs (see key-value pairs below)
- Lines 3..N+2: per-atom data lines with M columns each (see Properties and Per-Atom Data below)
key=value pairs on second ("comment") line
Associates per-configuration value with key. Spaces are allowed around = sign, which do not become part of the key or value.
Key: bare or quoted string
Value: primitive type, 1-D array, or 2-D array. Type is determined from context according to order specified above.
Special key "Properties”: defines the columns in the subsequent lines in the frame.
- Value is a string with the format of a series of triplets, separated by “:”, each triplet having the format: “<name>:<T>:<m>”.
- The <name> (string) names the column(s), <T> is a one of “S”, “I”, “R”, “L”, and indicates the type in the column, “string”, “integer”, “real”, “logical”, respectively. <m> is an integer > 0 specifying how many consecutive columns are being referred to.
- The sum of the counts "m" must equal number of per-atom columns M (as defined in FRAME)
- If after full parsing the key “Properties” is missing, the format is retroactively assumed to be plain xyz (4 columns, Z/species x y z), the entire second line is stored as a per-config “comment” property, and columns beyond the 4th are not read.
Per-atom data lines
Each column contains a sequence of primitive types, except string, which is replaced with simple string, separated by one or more whitespace characters, ending with EOL (optional for last line). The total number of columns in each row must be equal to the M and to the sum of the counts "m" in the "Properties" value string.
Further Information, Files, and Links
No response