Clover's DataRecord metadata

Clover metadata describes semantics of data record, this is how to create data record from elementary data types.

Representation of data within CloverETL

CloverETL works with data in terms of data records and data fields within records. Internally, all records are represented as variable length data. It means, that every data field consumes only as much memory as it is needed for storing field's value. If you have field of type STRING specified to be of 50 chars in length and this field is populated with string of 20 characters, only 20 characters are allocated in memory.

Moreover, CloverETL doesn't insist on any length to be specified. There is of course internal length maximum for any field, but it should be enough to accomodate even very long strings. We speak about strings because for other types there is fix size of the field regardless of the actual value.

Despite the information just given, there are some cases when it matters whether you specify the size of each field or not. This will be discussed in following text.

What types of data fields CloverETL supports ? Following table gives list of all supported types of data (so far) together with ranges of values for each type:

Overview of supported data types

Data type name Based on Size Range of values
string java.lang.String depends on actual data length
date java.util.Date 64bit - sizeof(long) starts: January 1, 1970, 00:00:00 GMT
increment: 1ms
integer java.lang.Integer 32bit - sizeof(int) min: -231
max: 231-1.
number java.lang.Double 64bit - sizeof(double) min: 2-1074
max: (2-2-52)·21023
numeric deprecated. The same as number 64bit - sizeof(double) min: 2-1074
max: (2-2-52)·21023
long lava.lang.Long 64bit – size of (long) min: -263
max: 263-1
decimal Clover's own implementation - similar to java.math.BigDecimal depends on “precision” parameter - maximum digits this datatype can contain (length) and maximum decimal digits (scale) depends on length and scale parameters
eg. decimal(6,2) can represent min: -9999.99 max: 9999.99
byte java.lang.Byte depends on actual data length min: -128
max: 127
cbyte
compressed array of bytes
java.lang.Byte depends on actual data length and success of compression min: -128
max: 127
boolean java.lang.Boolean represents one bit of information, but its “size” isn't something that's precisely defined true/false (1/0)

Specification of record format

Java style

One way of putting together description of record format is to create some Java code and use CloverETL classes/methods calls.
This snippet of source code shows how to do it:

  DataRecordMetadata metadata;
 
  metadata=new DataRecordMetadata("TestMetadata",DataRecordMetadata.DELIMITED_RECORD);
 
  metadata.addField(new DataFieldMetadata("Name",DataFieldMetadata.STRING_FIELD, ";"));
  metadata.addField(new DataFieldMetadata("Age",DataFieldMetadata.NUMERIC_FIELD, "|"));
  metadata.addField(new DataFieldMetadata("City",DataFieldMetadata.STRING_FIELD, "\n"));
  metadata.addField(new DataFieldMetadata("Born",DataFieldMetadata.DATE_FIELD, "\n"));
  metadata.addField(new DataFieldMetadata("Value",DataFieldMetadata.INTEGER_FIELD, "\n"));

Default field delimiter and record delimiter should be used since 2.4 version. Data readers better recognizes them then field delimiters. So metadata definition should be as follows:

  DataRecordMetadata metadata;
 
  metadata=new DataRecordMetadata("TestMetadata",DataRecordMetadata.DELIMITED_RECORD);
 
  metadata.addField(new DataFieldMetadata("Name",DataFieldMetadata.STRING_FIELD, null));
  metadata.addField(new DataFieldMetadata("Age",DataFieldMetadata.NUMERIC_FIELD, null));
  metadata.addField(new DataFieldMetadata("City",DataFieldMetadata.STRING_FIELD, "|"));
  metadata.addField(new DataFieldMetadata("Born",DataFieldMetadata.DATE_FIELD, null));
  metadata.addField(new DataFieldMetadata("Value",DataFieldMetadata.INTEGER_FIELD, null));
 
  metadata.setFieldDelimiter(";");
  metadata.setRecordDelimiters("\n");

In above metadata fields Name, Age and Born have ; as delimiter, field City - |, and last field (Value) is delimited from the next field (record) by \n.

XML serialized metadata

The easier way is to create a XML description of record format which can be read by CloverETL and materialized in memory automatically.
This “XML serialization” of metadata can be stored either in file on disk or in String variable in memory.
It is customary to use ”.fmt” extension for XML file containing XML serialized metadata of data record.
Following example shows simple XML serialized metadata describing record containing three data fields:

  <?xml version="1.0" encoding="UTF-8"?>
  <Record name="TestInput" type="delimited">
    <Field name="Name" type="string" delimiter=";"/>
    <Field name="Age" type="number" delimiter="|"/>
    <Field name="City" type="string" delimiter="\n"/>
  </Record>

The same as above for xml metadata definition:

  <?xml version="1.0" encoding="UTF-8"?>
  <Record name="TestInput" type="delimited" fieldDelimiter=";" recordDelimiter="\n">
    <Field name="Name" type="string"/>
    <Field name="Age" type="number" delimiter="|"/>
    <Field name="City" type="string"/>
  </Record>

Example above shows definition of data record named “TestInput” specified as delimited - this is some additional info used by CloverETL components.
The record has three fields:

  • Name (of type string)
  • Age (of type number)
  • City (of type string)

The XML representation also specifies encoding of characters - UTF-8 - it is imperative that when creating, you really save the file using the encoding specified. Otherwise XML parser used by CloverETL won't be able correctly interpret the file.

De-serializing XML metadata to Java object

When needed, metadata serialized in XML form can be easily read in and appropriate Java object representing metadata can be created automatically by CloverETL. Example deserializing metadata from XML

  DataRecordMetadata metadata;
  DataRecordMetadataXMLReaderWriter reader=new DataRecordMetadataXMLReaderWriter();
 
  try{
     metadata=reader.read(new FileInputStream("metadata.fmt"));
  }catch(IOException ex){
     System.err.println("Error when reading metadata!!");
  }

Record/Field Naming conventions

There is a strict rule for naming Data Fields (and Data Records).
This rule specifies that name can be composed of letters [a-zA-Z], numbers [0-9] (not at the first place) and [_] (underscore).

Example of valid/invalid names:

  • A_name123 is a valid name
  • *%1 is an invalid name

Delimiters

Each field in above given example has specified a delimiter (consists of one or more characters) which separates such filed from following one in text data file. This information is used by data parser when parsing data records (of this structure) from external text files. The same delimiters are used on the other hand when CloverETL outputs internal data records (of this structure) into output text files.
Delimiters can be of any length (actually up to 32 chars) and each field can have different one. Following control characters are supported:

  • \t (tabulator)
  • \n (line feed)
  • \r (carriage return)
  • \f (form feed)

Universal Data Reader can handle more (alternative) delimiters for one data field; in such case delimiters have to be delimited by \\|. In following example:

<Record fieldDelimiter=":\\|;" name="recordName1" previewAttachmentCharset="ISO-8859-1" recordDelimiter="\n\\|\r\n" type="delimited">
<Field name="field1" type="string"/>
<Field delimiter=":\\|;\\||" name="field2" type="integer"/>
.
.
.
<Field name="field3" type="number"/>
</Record>
records are delimited by Linux EOL (\n) or Windows EOL (\r\n) and fields are delimited by colon (:) or semicolon (;), field2 can be delimited by pipe (|) also.

Three types of delimiters:

  • fieldDelimiter - character(s) separating adjacent fields of each record (during parsing or formatting). This delimiter does not follow the last field of record.
  • recordDelimiter - character(s) marking the end of the whole record. It follows the last field of each record (during parsing or formatting).
  • delimiter - character(s) following specific field of each record. It can be seen as the attribute of the field in the source code. It replaces the original fieldDelimiter (if defined). If it follows the last field of record, it precedes the recordDelimiter (if defined).

Field formats and other goodies

Every field can have additional parameters specified.
These are:

  • delimiter - character(s) marking end of field's value in text representation of field (during parsing or formatting). It replaces the default fieldDelimiter (if any is defined) except the recordDelimiter. If it is defined for the last field, it follows the last field and precedes the recordDelimiter (if defined).
  • size - max length (in characters) of field's text representation when parsing from text or formatting into text
  • shift - offset (in characters) of field's beginning from previous field (can be positive or negative)
  • nullable - yes/no - can contain NULL value
  • format - a format string specifying in which form data is expected to appear in text form - see format paragraph.
  • locale - a locale which should be used when parsing/formatting field's value from/to text representation
  • auto_filling - defines name of function applied for automatic field filling by some readers
  • default - what is field's default value
  • trim - indicates if to trim input string before converting to value or not. Implicitly input strings for numeric and date fields are trimmed for other types aren't.
  • skipSourceRows - number of records skipped from the beginning of each single file when reading multiple files (wild cards - $ or * - are used in File URL). In such a case, multiple files satisfying the name pattern are read.

Following example shows additional parameters “in action”

  <?xml version="1.0" encoding="UTF-8"?>
  <!--Automatically generated from database null -->
  <Record name="EMPLOYEE" type="delimited">
    <Field name="EMP_NO" type="integer" delimiter="," format="#"/>
    <Field name="FIRST_NAME" type="string" delimiter="," />
    <Field name="LAST_NAME" type="string" delimiter="," />
    <Field name="PHONE_EXT" type="string" nullable="yes" delimiter=","/>
    <Field name="HIRE_DATE" type="date" delimiter="," format="dd/MM/yyyy" />
    <Field name="BIRTH_DATE" type="date" delimiter="," locale="en"/>
    <Field name="DEPT_NO" type="string" delimiter="," />
    <Field name="JOB_CODE" type="string" delimiter="," />
    <Field name="JOB_GRADE" type="numeric" delimiter="," format="#" />
    <Field name="JOB_COUNTRY" type="string" delimiter="," trim="true"/>
    <Field name="SALARY" type="numeric" delimiter="," />
    <Field name="FULL_NAME" type="string" nullable="yes" delimiter="\n" />
  </Record>

delimiter

Delimiter is used when data records are parsed from text representation of data (e.g. CSV). Each field can have its own delimiter consisting of one or more characters. Special characters (e.g. CR, LF,..) has their own text representations corresponding those from C or Java languages.
Fields can also have variant delimiters which allows for defining that field's end in data can be marked by one of defined set of characters. Variant delimiters are defined using \\| (double backslash & pipe) characters.
See following example which illustrates usage:

  <Record name="EMPLOYEE" type="delimited">
    <Field name="EMP_NO" type="integer" delimiter="$$" />
    <Field name="FIRST_NAME" type="string" delimiter="$$\\|##" />
    <Field name="LAST_NAME" type="string" delimiter="abc" />
  </Record>

Above example shows EMP_NO field with $$ as a field delimiter, FIRST_NAME which can be delimited either by $$ or ## and LAST_NAME delimited by abc string.

Note: delimiters are consider ONLY when data are parsed from text representation or formatted into text representation and are used inside Record which has type defined as delimited or mixed.

size

Size defines expected length (in characters) of field's text representation. It is used when parsing fix-length data is needed. It is also used when field's value is formatted into text representation when writing data out. If field's value is shorter than defined size, space character is used as padding (FixLenDataWriter component allows this to be user-defined). If text representation of field is longer then defined size, then text is truncated.

Example:

  <Record name="EMPLOYEE" type="fixed">
    <Field name="EMP_NO" type="integer" size="5" />
    <Field name="FIRST_NAME" type="string" size="15"/>
    <Field name="LAST_NAME" type="string" size="30" />
  </Record>

shift

Shift can be used to instruct parsing component that text representation of particular field does not start immediately after preceding field (or preceding record for 1st field) but at some offset (which can be both positive or negative). It allows, for example, to skip data which we are not interested in parsing or parse certain section of input data twice each time as different data type (negative shift can be used for this).

nullable

As you can see, some fields (PHONE_EXT for example) have attribute nullable set to “yes”. It basically means that for this field, it is allowed to contain null value. The default is yes/true (field can contain null) ! The exact behaviour is influenced by concrete data parser or data formatter, but simply put, when field is not specified to be nullable and application tries to put null value in it, this operation fails (which can result in stopping the whole transformation process).

format

Format attribute can be used for specifying expected format of data when parsing in or printing out of CloverETL. In this case, HIRE_DATE field is of type date and is specified, that date values in external textual data will look like this: 19/12/1999.
For all possible format specifiers (control characters), see documentation for “java.text.SimpleDateFormat”.
Similar to HIRE_DATE is JOB_GRADE field, which is of type numeric. Here the format specifies, that data is expected to be integer numbers only (no decimal point allowed).
Format is used for mapping types between CloverETL and database also. Three database types: date, time and timestamp corresponds with one Clover DateDataField type. Manner of loading data to/from database is chosen depending on the format string: if it contains date and time symbols, value is treated as timestamp, if it contains only date symbols, it is treated as date, if it contains only time symbols, it is treated as time, if format is not specified, it is treated as timestamp to avoid any information losing. When loading ByteDataField to/from database you can choose between two modes: binary (default) and blob. To use blob interface write blob in format (any other string or nothing will be treated as binary mode).
See following tables for date and number format specifiers (courtesy of SUN Corporation):

Date

Letter Date or Time Component Presentation Examples
“G” Era designator Text “AD”
“y” Year Year “1996”; “96”
“M” Month in year Month “July”; “Jul”; “07”
“w” Week in year Number “27”
“W” Week in month Number “2”
“D” Day in year Number “189”
“d” Day in month Number “10”
“F” Day of week in month Number “2”
“E” Day in week Text “Tuesday”; “Tue”
“a” Am/pm marker Text “PM”
“H” Hour in day (0-23) Number “0”
“k” Hour in day (1-24) Number “24”
“K” Hour in am/pm (0-11) Number “0”
“h” Hour in am/pm (1-12) Number “12”
“m” Minute in hour Number “30”
“s” Second in minute Number “55”
“S” Millisecond Number “978”
“z” Time zone General time zone “Pacific Standard Time”; “PST”; “GMT-08:00”
“Z” Time zone RFC 822 time zone ”-0800”

Examples of data formats and results

Date and Time Pattern Result
“yyyy.MM.dd G 'at' HH:mm:ss z” “2001.07.04 AD at 12:08:56 PDT”
“EEE, MMM d, yy” | “Wed, Jul 4, '01” | | “h:mm a” | “12:08 PM” | | “hh 'oclock' a, zzzz” “12 o'clock PM, Pacific Daylight Time”
“K:mm a, z” “0:08 PM, PDT”
“yyyyy.MMMMM.dd GGG hh:mm aaa” “02001.July.04 AD 12:08 PM”
“EEE, d MMM yyyy HH:mm:ss Z” “Wed, 4 Jul 2001 12:08:56 -0700”
“yyMMddHHmmssZ” “010704120856-0700”

Number

Number formatting specifiers

Symbol Location Localized? Meaning
“0” Number Yes Digit
”#” Number Yes Digit, zero shows as absent
”.” Number Yes Decimal separator or monetary decimal separator
”-” Number Yes Minus sign
”,” Number Yes Grouping separator
“E” Number Yes Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
”;” Subpattern boundary Yes Separates positive and negative subpatterns
”%” Prefix or suffix Yes Multiply by 100 and show as percentage
“\u2030” Prefix or suffix Yes Multiply by 1000 and show as per mille
“�” (“\u00A4”) Prefix or suffix No Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol.
If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
”'” Prefix or suffix No Used to quote special characters in a prefix or suffix, for example, ”'#'#” formats 123 to ””#123””.
To create a single quote itself, use two in a row: ”# o''clock”
Number format

When specifying format for numbers, Clover(Java) uses default system locale setting (unless other locale is specified through locale option).
This is important in cases when you are parsing data where decimal numbers use ””,”” (comma) as decimal separator whereas system default (national) says it is ”.” (point).
In such case, use locale option together with format option to change expected decimal delimiter. Example of using local &amp; format parameters

  <Field name="Freight" type="numeric" delimiter="|" format="#.#" locale="en.US"/>

locale

Instead of specifying format parameter (or together with format), you may specify a locale parameter – it states which geographical, political, or cultural region you want your information to be/is formatted for. Thus instead of specifying format for date field, specify Germany locale (eg. locale=“de“), for instance. Clover will automatically choose proper date format used in Germany. There are cases when both format and locale parameters have their sense – for example when specifying format of decimal numbers. You define format/pattern with decimal separator and locale specifies, whether the separator is a comma or dot.

auto_filling

Clover field marked as auto filling will be filled from a function that is specified in value of auto_filling attribute via this metadata attribute (ie: auto_filling=“global_row_count”). If a field has an auto_filling attribute defined, input parser skips the field and continues to next fields. All these fields are ignored by parsers.

Not all input components respect this attribute. The list of supported readers are following:

  • DelimitedDataReader
  • FixlenDataReader
  • UniversalDataReader
  • DBFDataReader
  • CloverDataReader
  • XLSDataReader
  • XMLExtract
  • XMLXPathReader
  • MultiLevelReader
  • DBInputTable
  • JMSReader

Following two components can use ErrCode and ErrText auto-filling functions:

  • DBOutputTable
  • DBExecute

And available functions:

  • default_value - if a reader does'nt read any value, the reader sets default value to input record
  • source_name - name of input source
  • global_row_count - start value for all count function is 0, this function counts every record for all sources. If the first input source has 100 records and the next has 50 records, the field will have values from 0 to 149. It couns rows for one reader.
  • source_row_count - the function counts every record for particular source separately. From the example above, the field will have values from 0 to 99 and then the counter will be started from 0 to 49. It couns rows for one reader.
  • metadata_row_count - the function counts every record for all sources but contrary to global_row_count, the value is counted for every metadata (data type) separately (there is supposed that readers can produce more than one type of data - for instance XML Extract - for each data type has separated counter). It couns rows for one reader.
  • metadata_source_row_count - the function counts every record for particular source and metadata (data type) separately. It couns rows for one reader.
  • source_timestamp - the function gets time stamp from a file. The time stamp is possible to get only from local files, not from compressed files, ftp or http connections.
  • source_size - the function gets file size from a file. The file size is possible to get only from local files, not from compressed files, ftp or http connections.
  • row_timestamp - the function creates time stamp for every row.
  • reader_timestamp - the function creates time stamp for a reader component.
  • sheet_name - name of the sheet of the input file.
  • ERR_CODE, ERR_TEXT - these functions are available only for some components, writes error code and error text if a record fails.


Following example shows usual definition of functions in metadata

  <Field delimiter=";" name="SourceName" type="string" auto_filling="source_name"/>
  <Field delimiter=";" name="SourceTimestamp" type="date" auto_filling="source_timestamp"/>
  <Field delimiter=";" name="GlobalRowCount" type="integer" auto_filling="global_row_count"/>
  <Field delimiter=";" name="SourceRowCount" type="integer" auto_filling="source_row_count"/>
  <Field delimiter=";" name="MetadataRowCount" type="integer" auto_filling="metadata_row_count"/>
  <Field delimiter=";" name="MetadataSourceRowCount" type="integer"
                       auto_filling="metadata_source_row_count"/>

default

CloverETL offers for each field default value to be specified. This value is used (in certain cases) when field is assigned to be NULL, but null value is not allowed for such field. It contrasts a little bit with what was stated before, but only on a first sight.
When field is assigned a default value, it is fist parsed from specified textual representation. If the text is not parseable - like if you specify default=“xyz” for numeric field (e.g. integer) - the assignment of default finishes with parse exception.
Also if you specify default value for date field, make sure it matches specified (or system default) date format - it holds in generally for all data types.

Following example shows fields with specified default values

  <?xml version="1.0" encoding="UTF-8"?>
  <Record name="Orders" type="delimited">
    <Field name="OrderID" type="numeric" delimiter="|" format="#" />
    <Field name="OrderDate" type="date"
           delimiter="|" format="dd.MM.yyyy"
           default="01.01.1900" nullable="no"/>
    <Field name="Amount" type="number" delimiter="\n" default="0.0" nullable="no" />
  </Record>

In this example, “OrderDate” is defaulted to 1.1.1900 in case it is not present in text data which this record is parsed from. In general, when this field is assigned null value, this specified default value is assigned instead. The same holds for “Amount” field, except the default is specified to be 0.
Said that, there is one more important note: this behaviour is not default and concerns only data parsers. If you, in your code, attempt to assign null value into not-nullable field, the “BadDataFormatException” will be raised.
If you use any of clover's data parsers, you may specify “DataPolicy” which states what should happen if parsed value can't be assigned to data field (as in case when value is null and field is not-nullable).
There are three different data policies defined:

  • “strict” - any BadDataFormatException aborts processing of graph. This is default value for specific readers.
  • “controlled” - every BadDataFormatException is only logged for entire record while processing continues for next record.
  • “lenient” - every BadDataFormatException is skipped while processing continues for next record.

Default value is also used when (in Java code) you call “setToDefault()” method on either DataRecord or DataField. In such case, Clover attempts to set/parse field's value from specified default.
Also when “reset()” method is called on DataRecord or DataField and field has defined default value, then the field is set to that default value.

Consult documentation regarding parser components for additional details !


Dynamic Metadata

In addition to all other metadata created or extracted using GUI, you can also write metadata definition in the Source tab of the Graph Editor pane. Unlike the metadata defined in GUI, such metadata written in the Source tab cannot be edited in GUI. To define the metadata in the Source tab, open this tab and write there the following: <Metadata id=“YourMetadataId” connection=“YourConnectionToDB” sqlQuery=“YourQuery”/> Select any expression for YourMetadataId, type your DB connection that should be used to connect to DB as YourConnectionToDB and type the query that will be used to extract data from DB as YourQuery. If you want to speed the run of your graph, you can also add to your query “where 1=0” or “and where 1=0” (the last expression should be added to the query terminated by other “where …” expression. This way only metadata will be extracted and no data will be read. Remember that such metadata cannot be created in GUI and will only be generated at the runtime.

DataRecord XML DTD

DTD of DataRecord metadata XML representation. Record metadata DTD:

    <!ELEMENT Record (FIELD+)>
    <!ATTLIST Record
               name ID #REQUIRED
               id ID #REQUIRED
               name CDATA #REQUIRED
               type NMTOKEN (delimited | fixed | mixed) #REQUIRED
               locale CDATA #IMPLIED
               recordDelimiter CDATA #IMPLIED
 
 
    <!ELEMENT Field (#PCDATA) EMPTY>
    <!ATTLIST Field
               name ID #REQUIRED
               type NMTOKEN #REQUIRED
               delimiter NMTOKEN #IMPLIED ","
               size NMTOKEN #IMPLIED "0"
               format CDATA #IMPLIED
               locale CDATA #IMPLIED
               nullable NMTOKEN (true | false) #IMPLIED "true"
               auto_filling NMTOKEN #IMPLIED
               default CDATA #IMPLIED


Metadata definition dialog:

graph_elements/metadata.txt · Last modified: 2010/06/10 17:21 by twaller
Back to top
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0