Clover metadata describes semantics of data record, this is how to create data record from elementary data types.
CloverETL works with data in terms of data records and data fields within records. Internally, all records are represented as variable length data. It means, that every data field consumes only as much memory as it is needed for storing field's value. If you have field of type STRING specified to be of 50 chars in length and this field is populated with string of 20 characters, only 20 characters are allocated in memory.
Moreover, CloverETL doesn't insist on any length to be specified. There is of course internal length maximum for any field, but it should be enough to accomodate even very long strings. We speak about strings because for other types there is fix size of the field regardless of the actual value.
Despite the information just given, there are some cases when it matters whether you specify the size of each field or not. This will be discussed in following text.
What types of data fields CloverETL supports ? Following table gives list of all supported types of data (so far) together with ranges of values for each type:
Overview of supported data types
| Data type name | Based on | Size | Range of values |
|---|---|---|---|
| string | java.lang.String | depends on actual data length | |
| date | java.util.Date | 64bit - sizeof(long) | starts: January 1, 1970, 00:00:00 GMT increment: 1ms |
| integer | java.lang.Integer | 32bit - sizeof(int) | min: -231 max: 231-1. |
| number | java.lang.Double | 64bit - sizeof(double) | min: 2-1074 max: (2-2-52)·21023 |
| numeric | deprecated. The same as number | 64bit - sizeof(double) | min: 2-1074 max: (2-2-52)·21023 |
| long | lava.lang.Long | 64bit – size of (long) | min: -263 max: 263-1 |
| decimal | Clover's own implementation - similar to java.math.BigDecimal | depends on “precision” parameter - maximum digits this datatype can contain (length) and maximum decimal digits (scale) | depends on length and scale parameters eg. decimal(6,2) can represent min: -9999.99 max: 9999.99 |
| byte | java.lang.Byte | depends on actual data length | min: -128 max: 127 |
| cbyte compressed array of bytes | java.lang.Byte | depends on actual data length and success of compression | min: -128 max: 127 |
| boolean | java.lang.Boolean | represents one bit of information, but its “size” isn't something that's precisely defined | true/false (1/0) |
One way of putting together description of record format is to create some Java code and use CloverETL classes/methods calls.
This snippet of source code shows how to do it:
DataRecordMetadata metadata; metadata=new DataRecordMetadata("TestMetadata",DataRecordMetadata.DELIMITED_RECORD); metadata.addField(new DataFieldMetadata("Name",DataFieldMetadata.STRING_FIELD, ";")); metadata.addField(new DataFieldMetadata("Age",DataFieldMetadata.NUMERIC_FIELD, "|")); metadata.addField(new DataFieldMetadata("City",DataFieldMetadata.STRING_FIELD, "\n")); metadata.addField(new DataFieldMetadata("Born",DataFieldMetadata.DATE_FIELD, "\n")); metadata.addField(new DataFieldMetadata("Value",DataFieldMetadata.INTEGER_FIELD, "\n"));
Default field delimiter and record delimiter should be used since 2.4 version. Data readers better recognizes them then field delimiters. So metadata definition should be as follows:
DataRecordMetadata metadata; metadata=new DataRecordMetadata("TestMetadata",DataRecordMetadata.DELIMITED_RECORD); metadata.addField(new DataFieldMetadata("Name",DataFieldMetadata.STRING_FIELD, null)); metadata.addField(new DataFieldMetadata("Age",DataFieldMetadata.NUMERIC_FIELD, null)); metadata.addField(new DataFieldMetadata("City",DataFieldMetadata.STRING_FIELD, "|")); metadata.addField(new DataFieldMetadata("Born",DataFieldMetadata.DATE_FIELD, null)); metadata.addField(new DataFieldMetadata("Value",DataFieldMetadata.INTEGER_FIELD, null)); metadata.setFieldDelimiter(";"); metadata.setRecordDelimiters("\n");
In above metadata fields Name, Age and Born have ; as delimiter, field City - |, and last field (Value) is delimited from the next field (record) by \n.
The easier way is to create a XML description of record format which can be read by CloverETL and materialized in memory automatically.
This “XML serialization” of metadata can be stored either in file on disk or in String variable in memory.
It is customary to use ”.fmt” extension for XML file containing XML serialized metadata of data record.
Following example shows simple XML serialized metadata describing record containing three data fields:
<?xml version="1.0" encoding="UTF-8"?> <Record name="TestInput" type="delimited"> <Field name="Name" type="string" delimiter=";"/> <Field name="Age" type="number" delimiter="|"/> <Field name="City" type="string" delimiter="\n"/> </Record>
The same as above for xml metadata definition:
<?xml version="1.0" encoding="UTF-8"?> <Record name="TestInput" type="delimited" fieldDelimiter=";" recordDelimiter="\n"> <Field name="Name" type="string"/> <Field name="Age" type="number" delimiter="|"/> <Field name="City" type="string"/> </Record>
Example above shows definition of data record named “TestInput” specified as delimited - this is some additional info used by CloverETL components.
The record has three fields:
The XML representation also specifies encoding of characters - UTF-8 - it is imperative that when creating, you really save the file using the encoding specified. Otherwise XML parser used by CloverETL won't be able correctly interpret the file.
When needed, metadata serialized in XML form can be easily read in and appropriate Java object representing metadata can be created automatically by CloverETL. Example deserializing metadata from XML
DataRecordMetadata metadata; DataRecordMetadataXMLReaderWriter reader=new DataRecordMetadataXMLReaderWriter(); try{ metadata=reader.read(new FileInputStream("metadata.fmt")); }catch(IOException ex){ System.err.println("Error when reading metadata!!"); }
There is a strict rule for naming Data Fields (and Data Records).
This rule specifies that name can be composed of letters [a-zA-Z], numbers [0-9] (not at the first place) and [_] (underscore).
Example of valid/invalid names:
Each field in above given example has specified a delimiter (consists of one or more characters) which separates such filed from following one in text data file. This information is used by data parser when parsing data records (of this structure) from external text files. The same delimiters are used on the other hand when CloverETL outputs internal data records (of this structure) into output text files.
Delimiters can be of any length (actually up to 32 chars) and each field can have different one. Following control characters are supported:
Universal Data Reader can handle more (alternative) delimiters for one data field; in such case delimiters have to be delimited by \\|. In following example:
<Record fieldDelimiter=":\\|;" name="recordName1" previewAttachmentCharset="ISO-8859-1" recordDelimiter="\n\\|\r\n" type="delimited"> <Field name="field1" type="string"/> <Field delimiter=":\\|;\\||" name="field2" type="integer"/> . . . <Field name="field3" type="number"/> </Record>records are delimited by Linux EOL (\n) or Windows EOL (\r\n) and fields are delimited by colon (:) or semicolon (;), field2 can be delimited by pipe (|) also.
Three types of delimiters:
fieldDelimiter - character(s) separating adjacent fields of each record (during parsing or formatting). This delimiter does not follow the last field of record.recordDelimiter - character(s) marking the end of the whole record. It follows the last field of each record (during parsing or formatting).delimiter - character(s) following specific field of each record. It can be seen as the attribute of the field in the source code. It replaces the original fieldDelimiter (if defined). If it follows the last field of record, it precedes the recordDelimiter (if defined).
Every field can have additional parameters specified.
These are:
delimiter - character(s) marking end of field's value in text representation of field (during parsing or formatting). It replaces the default fieldDelimiter (if any is defined) except the recordDelimiter. If it is defined for the last field, it follows the last field and precedes the recordDelimiter (if defined).size - max length (in characters) of field's text representation when parsing from text or formatting into textshift - offset (in characters) of field's beginning from previous field (can be positive or negative) nullable - yes/no - can contain NULL valueformat - a format string specifying in which form data is expected to appear in text form - see format paragraph.locale - a locale which should be used when parsing/formatting field's value from/to text representationauto_filling - defines name of function applied for automatic field filling by some readersdefault - what is field's default valuetrim - indicates if to trim input string before converting to value or not. Implicitly input strings for numeric and date fields are trimmed for other types aren't. skipSourceRows - number of records skipped from the beginning of each single file when reading multiple files (wild cards - $ or * - are used in File URL). In such a case, multiple files satisfying the name pattern are read.Following example shows additional parameters “in action”
<?xml version="1.0" encoding="UTF-8"?> <!--Automatically generated from database null --> <Record name="EMPLOYEE" type="delimited"> <Field name="EMP_NO" type="integer" delimiter="," format="#"/> <Field name="FIRST_NAME" type="string" delimiter="," /> <Field name="LAST_NAME" type="string" delimiter="," /> <Field name="PHONE_EXT" type="string" nullable="yes" delimiter=","/> <Field name="HIRE_DATE" type="date" delimiter="," format="dd/MM/yyyy" /> <Field name="BIRTH_DATE" type="date" delimiter="," locale="en"/> <Field name="DEPT_NO" type="string" delimiter="," /> <Field name="JOB_CODE" type="string" delimiter="," /> <Field name="JOB_GRADE" type="numeric" delimiter="," format="#" /> <Field name="JOB_COUNTRY" type="string" delimiter="," trim="true"/> <Field name="SALARY" type="numeric" delimiter="," /> <Field name="FULL_NAME" type="string" nullable="yes" delimiter="\n" /> </Record>
Delimiter is used when data records are parsed from text representation of data (e.g. CSV). Each field can have its own delimiter consisting of one or more characters. Special characters (e.g. CR, LF,..) has their own text representations corresponding those from C or Java languages.
Fields can also have variant delimiters which allows for defining that field's end in data can be marked by one of defined set of characters. Variant delimiters are defined using \\| (double backslash & pipe) characters.
See following example which illustrates usage:
<Record name="EMPLOYEE" type="delimited"> <Field name="EMP_NO" type="integer" delimiter="$$" /> <Field name="FIRST_NAME" type="string" delimiter="$$\\|##" /> <Field name="LAST_NAME" type="string" delimiter="abc" /> </Record>
Above example shows EMP_NO field with $$ as a field delimiter, FIRST_NAME which can be delimited either by $$ or ## and LAST_NAME delimited by abc string.
Note: delimiters are consider ONLY when data are parsed from text representation or formatted into text representation and are used inside Record which has type defined as delimited or mixed.
Size defines expected length (in characters) of field's text representation. It is used when parsing fix-length data is needed. It is also used when field's value is formatted into text representation when writing data out. If field's value is shorter than defined size, space character is used as padding (FixLenDataWriter component allows this to be user-defined). If text representation of field is longer then defined size, then text is truncated.
Example:
<Record name="EMPLOYEE" type="fixed"> <Field name="EMP_NO" type="integer" size="5" /> <Field name="FIRST_NAME" type="string" size="15"/> <Field name="LAST_NAME" type="string" size="30" /> </Record>
Shift can be used to instruct parsing component that text representation of particular field does not start immediately after preceding field (or preceding record for 1st field) but at some offset (which can be both positive or negative). It allows, for example, to skip data which we are not interested in parsing or parse certain section of input data twice each time as different data type (negative shift can be used for this).
As you can see, some fields (PHONE_EXT for example) have attribute nullable set to “yes”. It basically means that for this field, it is allowed to contain null value. The default is yes/true (field can contain null) ! The exact behaviour is influenced by concrete data parser or data formatter, but simply put, when field is not specified to be nullable and application tries to put null value in it, this operation fails (which can result in stopping the whole transformation process).
Format attribute can be used for specifying expected format of data when parsing in or printing out of CloverETL. In this case, HIRE_DATE field is of type date and is specified, that date values in external textual data will look like this: 19/12/1999.
For all possible format specifiers (control characters), see documentation for “java.text.SimpleDateFormat”.
Similar to HIRE_DATE is JOB_GRADE field, which is of type numeric. Here the format specifies, that data is expected to be integer numbers only (no decimal point allowed).
Format is used for mapping types between CloverETL and database also. Three database types: date, time and timestamp corresponds with one Clover DateDataField type. Manner of loading data to/from database is chosen depending on the format string: if it contains date and time symbols, value is treated as timestamp, if it contains only date symbols, it is treated as date, if it contains only time symbols, it is treated as time, if format is not specified, it is treated as timestamp to avoid any information losing. When loading ByteDataField to/from database you can choose between two modes: binary (default) and blob. To use blob interface write blob in format (any other string or nothing will be treated as binary mode).
See following tables for date and number format specifiers (courtesy of SUN Corporation):
Date
| Letter | Date or Time Component | Presentation | Examples |
|---|---|---|---|
| “G” | Era designator | Text | “AD” |
| “y” | Year | Year | “1996”; “96” |
| “M” | Month in year | Month | “July”; “Jul”; “07” |
| “w” | Week in year | Number | “27” |
| “W” | Week in month | Number | “2” |
| “D” | Day in year | Number | “189” |
| “d” | Day in month | Number | “10” |
| “F” | Day of week in month | Number | “2” |
| “E” | Day in week | Text | “Tuesday”; “Tue” |
| “a” | Am/pm marker | Text | “PM” |
| “H” | Hour in day (0-23) | Number | “0” |
| “k” | Hour in day (1-24) | Number | “24” |
| “K” | Hour in am/pm (0-11) | Number | “0” |
| “h” | Hour in am/pm (1-12) | Number | “12” |
| “m” | Minute in hour | Number | “30” |
| “s” | Second in minute | Number | “55” |
| “S” | Millisecond | Number | “978” |
| “z” | Time zone | General time zone | “Pacific Standard Time”; “PST”; “GMT-08:00” |
| “Z” | Time zone | RFC 822 time zone | ”-0800” |
Examples of data formats and results
| Date and Time Pattern | Result |
|---|---|
| “yyyy.MM.dd G 'at' HH:mm:ss z” | “2001.07.04 AD at 12:08:56 PDT” |
“EEE, MMM d, yy” | “Wed, Jul 4, '01” |
| “h:mm a” | “12:08 PM” |
| “hh 'oclock' a, zzzz” | “12 o'clock PM, Pacific Daylight Time” |
| “K:mm a, z” | “0:08 PM, PDT” |
| “yyyyy.MMMMM.dd GGG hh:mm aaa” | “02001.July.04 AD 12:08 PM” |
| “EEE, d MMM yyyy HH:mm:ss Z” | “Wed, 4 Jul 2001 12:08:56 -0700” |
| “yyMMddHHmmssZ” | “010704120856-0700” |
Number
Number formatting specifiers
| Symbol | Location | Localized? | Meaning |
|---|---|---|---|
| “0” | Number | Yes | Digit |
| ”#” | Number | Yes | Digit, zero shows as absent |
| ”.” | Number | Yes | Decimal separator or monetary decimal separator |
| ”-” | Number | Yes | Minus sign |
| ”,” | Number | Yes | Grouping separator |
| “E” | Number | Yes | Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix. |
| ”;” | Subpattern boundary | Yes | Separates positive and negative subpatterns |
| ”%” | Prefix or suffix | Yes | Multiply by 100 and show as percentage |
| “\u2030” | Prefix or suffix | Yes | Multiply by 1000 and show as per mille |
| “�” (“\u00A4”) | Prefix or suffix | No | Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator. |
| ”'” | Prefix or suffix | No | Used to quote special characters in a prefix or suffix, for example, ”'#'#” formats 123 to ””#123””. To create a single quote itself, use two in a row: ”# o''clock” |
When specifying format for numbers, Clover(Java) uses default system locale setting (unless other locale is specified through locale option).
This is important in cases when you are parsing data where decimal numbers use ””,”” (comma) as decimal separator whereas system default (national) says it is ”.” (point).
In such case, use locale option together with format option to change expected decimal delimiter. Example of using local & format parameters
<Field name="Freight" type="numeric" delimiter="|" format="#.#" locale="en.US"/>
Instead of specifying format parameter (or together with format), you may specify a locale parameter – it states which geographical, political, or cultural region you want your information to be/is formatted for. Thus instead of specifying format for date field, specify Germany locale (eg. locale=“de“), for instance. Clover will automatically choose proper date format used in Germany. There are cases when both format and locale parameters have their sense – for example when specifying format of decimal numbers. You define format/pattern with decimal separator and locale specifies, whether the separator is a comma or dot.
Clover field marked as auto filling will be filled from a function that is specified in value of auto_filling attribute via this metadata attribute (ie: auto_filling=“global_row_count”). If a field has an auto_filling attribute defined, input parser skips the field and continues to next fields. All these fields are ignored by parsers.
Not all input components respect this attribute. The list of supported readers are following:
DelimitedDataReaderFixlenDataReaderUniversalDataReaderDBFDataReaderCloverDataReaderXLSDataReaderXMLExtractXMLXPathReaderMultiLevelReaderDBInputTableJMSReader
Following two components can use ErrCode and ErrText auto-filling functions:
DBOutputTableDBExecute
And available functions:
default_value - if a reader does'nt read any value, the reader sets default value to input recordsource_name - name of input sourceglobal_row_count - start value for all count function is 0, this function counts every record for all sources. If the first input source has 100 records and the next has 50 records, the field will have values from 0 to 149. It couns rows for one reader.source_row_count - the function counts every record for particular source separately. From the example above, the field will have values from 0 to 99 and then the counter will be started from 0 to 49. It couns rows for one reader.metadata_row_count - the function counts every record for all sources but contrary to global_row_count, the value is counted for every metadata (data type) separately (there is supposed that readers can produce more than one type of data - for instance XML Extract - for each data type has separated counter). It couns rows for one reader.metadata_source_row_count - the function counts every record for particular source and metadata (data type) separately. It couns rows for one reader.source_timestamp - the function gets time stamp from a file. The time stamp is possible to get only from local files, not from compressed files, ftp or http connections. source_size - the function gets file size from a file. The file size is possible to get only from local files, not from compressed files, ftp or http connections. row_timestamp - the function creates time stamp for every row. reader_timestamp - the function creates time stamp for a reader component. sheet_name - name of the sheet of the input file.ERR_CODE, ERR_TEXT - these functions are available only for some components, writes error code and error text if a record fails.
Following example shows usual definition of functions in metadata
<Field delimiter=";" name="SourceName" type="string" auto_filling="source_name"/> <Field delimiter=";" name="SourceTimestamp" type="date" auto_filling="source_timestamp"/> <Field delimiter=";" name="GlobalRowCount" type="integer" auto_filling="global_row_count"/> <Field delimiter=";" name="SourceRowCount" type="integer" auto_filling="source_row_count"/> <Field delimiter=";" name="MetadataRowCount" type="integer" auto_filling="metadata_row_count"/> <Field delimiter=";" name="MetadataSourceRowCount" type="integer" auto_filling="metadata_source_row_count"/>
CloverETL offers for each field default value to be specified. This value is used (in certain cases) when field is assigned to be NULL, but null value is not allowed for such field. It contrasts a little bit with what was stated before, but only on a first sight.
When field is assigned a default value, it is fist parsed from specified textual representation. If the text is not parseable - like if you specify default=“xyz” for numeric field (e.g. integer) - the assignment of default finishes with parse exception.
Also if you specify default value for date field, make sure it matches specified (or system default) date format - it holds in generally for all data types.
Following example shows fields with specified default values
<?xml version="1.0" encoding="UTF-8"?> <Record name="Orders" type="delimited"> <Field name="OrderID" type="numeric" delimiter="|" format="#" /> <Field name="OrderDate" type="date" delimiter="|" format="dd.MM.yyyy" default="01.01.1900" nullable="no"/> <Field name="Amount" type="number" delimiter="\n" default="0.0" nullable="no" /> </Record>
In this example, “OrderDate” is defaulted to 1.1.1900 in case it is not present in text data which this record is parsed from. In general, when this field is assigned null value, this specified default value is assigned instead. The same holds for “Amount” field, except the default is specified to be 0.
Said that, there is one more important note: this behaviour is not default and concerns only data parsers. If you, in your code, attempt to assign null value into not-nullable field, the “BadDataFormatException” will be raised.
If you use any of clover's data parsers, you may specify “DataPolicy” which states what should happen if parsed value can't be assigned to data field (as in case when value is null and field is not-nullable).
There are three different data policies defined:
Default value is also used when (in Java code) you call “setToDefault()” method on either DataRecord or DataField. In such case, Clover attempts to set/parse field's value from specified default.
Also when “reset()” method is called on DataRecord or DataField and field has defined default value, then the field is set to that default value.
Consult documentation regarding parser components for additional details !
In addition to all other metadata created or extracted using GUI, you can also write metadata definition in the Source tab of the Graph Editor pane. Unlike the metadata defined in GUI, such metadata written in the Source tab cannot be edited in GUI. To define the metadata in the Source tab, open this tab and write there the following:
<Metadata id=“YourMetadataId” connection=“YourConnectionToDB” sqlQuery=“YourQuery”/>
Select any expression for YourMetadataId, type your DB connection that should be used to connect to DB as YourConnectionToDB and type the query that will be used to extract data from DB as YourQuery.
If you want to speed the run of your graph, you can also add to your query “where 1=0” or “and where 1=0” (the last expression should be added to the query terminated by other “where …” expression. This way only metadata will be extracted and no data will be read.
Remember that such metadata cannot be created in GUI and will only be generated at the runtime.
DTD of DataRecord metadata XML representation. Record metadata DTD:
<!ELEMENT Record (FIELD+)> <!ATTLIST Record name ID #REQUIRED id ID #REQUIRED name CDATA #REQUIRED type NMTOKEN (delimited | fixed | mixed) #REQUIRED locale CDATA #IMPLIED recordDelimiter CDATA #IMPLIED <!ELEMENT Field (#PCDATA) EMPTY> <!ATTLIST Field name ID #REQUIRED type NMTOKEN #REQUIRED delimiter NMTOKEN #IMPLIED "," size NMTOKEN #IMPLIED "0" format CDATA #IMPLIED locale CDATA #IMPLIED nullable NMTOKEN (true | false) #IMPLIED "true" auto_filling NMTOKEN #IMPLIED default CDATA #IMPLIED