Lookup tables

Lookup tables contain data records that are searched.

There are five variants of lookup tables currently supported:

  • SimpleLookup - complete lookup table stored in memory, data is read from external text file.
  • DBLookup - lookup table base on parametrized SQL query. Results can be cached in memory to limit access to database.
  • RangeLookup - elements of lookup are “intervals”: each interval has start and end point(s).
  • PersistentLookup - complete lookup table stored in external file. It uses jdbm.
  • AspellLookup - an Aspell-like lookup table that enables you to look up data records with similar lookup key.

Since 2.6 version all lookup tables can be internal - all attributes and theirs values are stored directly in graph, or external (shared) - attributes and theirs values are stored in separate file and in graph there is only link to the file eg.:

in graph:

 <LookupTable id="LookupTable0" lookupConfig="${LOOKUP_DIR}/MyRangeLookup.cfg"/>

in MyRangeLookup.cfg file:

endFields=dateEnd;timeEnd;nameEnd
endInclude=true;true;true
metadata=${META_DIR}/delimited/schedule.fmt
name=Tour schedule
startFields=dateStart;timeStart;nameStart
startInclude=true;true;true
type=rangeLookup
useI18N=true

SimpleLookup

All data records stored in the simple lookup table are kept in memory. Therefore, sufficient memory has to be available in order to accommodate all data records to be stored. If the data records are loaded from a data file, the size of available memory should be approximately 6× bigger than the size of the data file. However, this multiplier may vary for different types of data records stored in the data file.


Xml attributes:

Attribute Mandatory Description Default Version
id yes component identification
type yes component type simpleLookup
metadata yes metadata associated with the DataRecord stored in this lookup table. For internal table this is id of requested metadata, for external table this is file with metadata definition
fileURL yes path to the data file.
key yes names of fields which comprise indexKey to lookup table. Specifies what object will be used for looking up data.
dataType yes type of file data - delimited or fixed till 2.4
charset no character encoding of the lookup data ISO-8859-1
initialSize no initial size of lookup table “Lookup.LOOKUP_INITIAL_CAPACITY”
from properties or 512
keyDuplicates no indicates if there can be more records stored with one key false since 2.6


Xml definition:
Transformation for ver. 2.5:

  <Global>
    ...
    <LookupTable id="LookupTable0" type="simpleLookup" metadata="Metadata0" 
                 fileURL="data/delimited/employees.dat" dataType="delimited" charset="ISO-8859-1" 
                 key="EmployeeID" name="name" 
    />
  </Global>
 
  <Node id="JOIN" type="REFORMAT">
    <attr name="transform">
        import org.jetel.component.DataRecordTransform;
        import org.jetel.data.DataRecord;
        import org.jetel.data.RecordKey;
        import org.jetel.data.lookup.LookupTable;
        import org.jetel.exception.*;
        import org.jetel.graph.TransformationGraph;
 
        public class reformatTest extends DataRecordTransform {
	    LookupTable mylookup;
 
	    public boolean init(){
		try {
			mylookup = graph.getLookupTable("LookupTable0");
			if (!mylookup.isInitialized()) {
				mylookup.init();
			}
		} catch (ComponentNotReadyException e) {
			e.printStackTrace();
                        return false;
		}
 
	        // create key 
	        String[] lookupKeyStr={"EmployeeID"};
		RecordKey key=new RecordKey(lookupKeyStr,sourceMetadata[0]);
		key.init();
		mylookup.setLookupKey(key);
		return true;
	    }
 
	    public boolean transform(DataRecord[] source, DataRecord[] target){
		DataRecord employee; 
		employee=mylookup.get(source[0]);
 
		if (employee==null) return false; // skip this one
 
		target[0].getField(0).setValue(source[0].getField(0).getValue());
  		target[0].getField(1).setValue(source[0].getField(1).getValue());
		target[0].getField(2).setValue(source[0].getField(2).getValue().toString());
		target[0].getField(3).setValue(employee.getField(0).getValue().toString());
		target[0].getField(4).setValue(employee.getField(1).getValue());
 
 
		return true;
	    }
 
	    public void finished(){
		mylookup.free();
	    }
         }
    </attr>
  </Node>
Transformation since ver. 2.6:
  <Global>
    ...
    <LookupTable id="LookupTable0" type="simpleLookup" metadata="Metadata0" 
                 fileURL="data/delimited/employees.dat" dataType="delimited" charset="ISO-8859-1" 
                 key="EmployeeID" name="name" 
    />
  </Global>
 
  <Node id="JOIN" type="REFORMAT">
    <attr name="transform">
 
import org.jetel.component.DataRecordTransform;
import org.jetel.data.DataRecord;
import org.jetel.data.RecordKey;
import org.jetel.data.lookup.Lookup;
import org.jetel.data.lookup.LookupTable;
import org.jetel.exception.ComponentNotReadyException;
 
public class reformatTest extends DataRecordTransform{
 
	Lookup mylookup;
 
	public boolean init() throws ComponentNotReadyException{
 
		// create key 
	    String[] lookupKeyStr={"EmployeeID"};
		RecordKey key=new RecordKey(lookupKeyStr,sourceMetadata[0]);
		key.init();
 
		LookupTable lt = graph.getLookupTable("LookupTable0");
		lt.init();
		mylookup = lt.createLookup(key);
 
		return true;
	}
 
	public int transform(DataRecord[] source, DataRecord[] target){
 
		mylookup.seek(source[0]);
 
		if (!mylookup.hasNext()) {// skip this one
			errorMessage = "Order: " + source[0].getField(0).getValue() + " - no corresponding record in lookup table";
			return SKIP; 
		}
 
		DataRecord employee = (DataRecord)mylookup.next(); 
 
		target[0].getField(0).setValue(source[0].getField(0).getValue());
  		target[0].getField(1).setValue(source[0].getField(1).getValue());
		target[0].getField(2).setValue(source[0].getField(2).getValue().toString());
		target[0].getField(3).setValue(employee.getField(0).getValue().toString());
		target[0].getField(4).setValue(employee.getField(1).getValue());
 
		return ALL;
	}
}
    </attr>
  </Node>


Lookup definition dialog:





DBLookup

Attribute Mandatory Description Default
id yes component identification
type yes component type dbLookup
metadata yes metadata associated with the DataRecord stored in this lookup table. For internal table this is id of requested metadata, for external table this is file with metadata definition
sqlQuery yes db query for getting data from database in accordance to key value.
dbConnection yes db connection to database. For internal table this is id of requested connection, for external table this is file with connection properties
maxCached no max record cached in lookup table 0
storeNulls no indicates if to remember records for which there are no corresponding records in database false


Xml definition:

  <Global>
    ...
    <LookupTable id="LookupTable0" type="dbLookup" metadata="Metadata0" 
                  dbConnection="DBConnection0" name="name" >
                  <attr name="sqlQuery">select * from employee where employee_id=?</attr>
    </LookupTable>
  </Global>
 
  <Node id="JOIN" type="REFORMAT">
    <attr name="transform">
        import org.jetel.component.DataRecordTransform;
        import org.jetel.data.DataRecord;
        import org.jetel.data.RecordKey;
        import org.jetel.data.lookup.LookupTable;
        import org.jetel.exception.*;
        import org.jetel.graph.TransformationGraph;
 
        public class reformatTest extends DataRecordTransform {
	    LookupTable mylookup;
 
	    public boolean init(){
		try {
			mylookup = graph.getLookupTable("LookupTable0");
			if (!mylookup.isInitialized()) {
				mylookup.init();
			}
		} catch (ComponentNotReadyException e) {
			e.printStackTrace();
                        return false;
		}
 
	        // create key 
	        String[] lookupKeyStr={"EmployeeID"};
		RecordKey key=new RecordKey(lookupKeyStr,sourceMetadata[0]);
		key.init();
		mylookup.setLookupKey(key);
		return true;
	    }
 
	    public boolean transform(DataRecord[] source, DataRecord[] target){
		DataRecord employee; 
		employee=mylookup.get(source[0]);
 
		if (employee==null) return false; // skip this one
 
		target[0].getField(0).setValue(source[0].getField(0).getValue());
  		target[0].getField(1).setValue(source[0].getField(1).getValue());
		target[0].getField(2).setValue(source[0].getField(2).getValue().toString());
		target[0].getField(3).setValue(employee.getField(0).getValue().toString());
		target[0].getField(4).setValue(employee.getField(1).getValue());
 
		return true;
	    }
 
	    public void finished(){
		mylookup.free();
	    }
         }
    </attr>
  </Node>

RangeLookup

Range lookup table contains records, which defines intervals. It means that they must have special structure: first field is the name of the interval, odd fields marks starts of intervals, even fields (from 2) means corresponding ends of intervals, eg: Lookup table defined as follows:

  • low_slow,0,10,0,50
  • low_fast,0,10,50,100
  • high_slow,10,20,0,50
  • high_fast,10,20,50,100

has 4 intervals with 2 searching parameters: first from interval 0-10, and second from interval 0-100. Intervals can overlap.
It is possible to use “unlimited” intervals by setting null value to start or end field, eg. interval null,0 “contains” elements smaller then 0.


Xml attributes:

Attribute Mandatory Description Default Version
id yes component identification
type yes component type simpleLookup
metadata yes metadata associated with the DataRecord stored in this lookup table. For internal table this is id of requested metadata, for external table this is file with metadata definition
fileURL yes path to the data file.
dataType yes type of file data - delimited or fixed till 2.4
charset no character encoding of the lookup data ISO-8859-1
useI18N no true/false perform comparing according to national rules - e.g. Czech or German handling of characters like “i”,”í”. false
locale no locale to be used when sorting using I18N rules. system default
startInclude no indicates if start points are included to intervals. It has to have such number of booleans as there is number of intervals true
endInclude no indicates if end points are included to intervals. It has to have such number of booleans as there is number of intervals false


Xml definition:

  <Global>
    ...
    <LookupTable type="rangeLookup" id="LookupTable0" metadata="Metadata0" charset="ISO-8859-1" dataType="delimited">
          <attr name="endInclude">true;true;true</attr>
          <attr name="startInclude">true;true;true</attr>
          <attr name="useI18N">true</attr>
          <attr name="name">Tour schedule</attr>
    </LookupTable>
  </Global>
 
  <Node id="LOOKUP_JOIN0" joinKey="date;time;name" lookupTable="LookupTable0" type="LOOKUP_JOIN">
     <attr name="transform">${out.0.name} = ${in.0.name};
                            ${out.0.date} = ${in.0.date};
                            ${out.0.time} = ${in.0.time};
                            ${out.0.activity} = ${in.1.name};
     </attr>
  </Node>

PersistentLookup

ETL Version since: 2.6
Complete lookup table stored in external file. It uses jdbm.


Xml attributes:

Attribute Mandatory Description Default
id yes component identification
type yes component type persistentLookup
fileURL yes path to the data file where lookup is stored
metadata yes metadata associated with the DataRecord stored in this lookup table. For internal table this is id of requested metadata, for external table this is file with metadata definition
key yes names of fields which comprise indexKey to lookup table. Specifies what object will be used for looking up data.
replace no option to replace existing entry (persistent lookup cannot store duplicate entries) true
commitInterval no commit interval in number of records (too high interval can invoke java.nio.BufferUnderflowException) 100
disableTransactions no option to disable transaction (to increase performance at the cost of potential data loss). false
pageSize no number of entries per page (must be even) 16
cacheSize no given maximum number of objects in cache 1000


Xml definition:

<Global>
  ...
   <LookupTable fileURL="${PROJECT}/data-out/persistentLookup" id="LookupTable1" key="EmployeeID" metadata="Metadata0" name="lookupName" type="persistentLookup"/>
</Global>
 
<Node enabled="enabled" id="JOIN" type="REFORMAT">
<attr name="transform"><![CDATA[
import org.jetel.component.DataRecordTransform;
import org.jetel.data.DataRecord;
import org.jetel.data.RecordKey;
import org.jetel.data.lookup.LookupTable;
import org.jetel.exception.*;
import org.jetel.graph.TransformationGraph;
 
public class reformatTest extends DataRecordTransform{
 
	LookupTable mylookup;
 
	public boolean init(){
 
		try {
			mylookup = graph.getLookupTable("LookupTable1");
			if (!mylookup.isInitialized()) {
				mylookup.init();
			}
		} catch (ComponentNotReadyException e) {
			e.printStackTrace();
                        return false;
		}
 
		// create key 
  	        String[] lookupKeyStr={"EmployeeID"};
		RecordKey key=new RecordKey(lookupKeyStr,sourceMetadata[0]);
		key.init();
		mylookup.setLookupKey(key);
		return true;
	}
 
	public int transform(DataRecord[] source, DataRecord[] target){
 
		DataRecord employee; 
		employee=mylookup.get(source[0]);
 
		if (employee==null) {// skip this one
			errorMessage = "Input:\n" + source[0] + " - There is no record with corresponding key in lookup table\n";
			return SKIP; 
		}
 
		target[0].getField(0).setValue(source[0].getField(0).getValue());
  		target[0].getField(1).setValue(source[0].getField(1).getValue());
		target[0].getField(2).setValue(source[0].getField(2).getValue().toString());
		target[0].getField(3).setValue(employee.getField(0).getValue().toString());
		target[0].getField(4).setValue(employee.getField(1).getValue());
 
		return ALL;
	}
 
	public void finished(){
		mylookup.free();
	}
}
]]></attr>
</Node>

AspellLookup

COMMERCIAL, since version 2.6

An Aspell-like lookup table that enables you to look up data records with similar lookup key. The similarity of two words (keys) is measured in terms of edit operations, their cost and sum of these costs (see below). Such approach may be useful e. g. when looking for a street the name of which is misspelled to a certain extent.

All data records stored in the Aspell lookup table are kept in memory. Therefore, sufficient memory has to be available in order to accommodate all data records to be stored. If the data records are loaded from a data file, the size of available memory should be approximately 7× bigger than the size of the data file. However, this multiplier may vary for different types of data records stored in the data file.


Similarity of Words:

The edit operations and their default costs are as follows:

Operation Description Cost
CASE change the case of a single character 10
TRANSPOSE transpose two adjacent characters 90
DELETE delete a single character 95
INSERT insert a single character 95
REPLACE replace a single character 100

The edit distance is defined as the minimum sum of costs of edit operations required to transform a misspelled word into a correctly spelled one. The edit distance between two words is actually the similarity of these two words.

Let's assume that the Aspell lookup table contains only the word “Hello”. Then the edit distance of the following queries would be as follows:

Query Edit Operations Edit Distance
HELLO CASE (E) + CASE (L) + CASE (L) + CASE (O) 40
helo CASE (h) + INSERT (l) 105
Halllo REPLACE (a → e) + REMOVE (l) 195
haLol CASE (h) + REPLACE (a → e) + CASE (L) + TRANSPOSE (ol) 210


XML Attributes:

Attribute Mandatory Short Description Default ETL Version
id yes The ID of the lookup table. N/A
type yes The type of the lookup table (has to be aspellLookup). N/A
name no The name of the lookup table. null
metadata yes The ID of the meta data associated with the lookup table. N/A
lookupKeyField yes The name of a string data field used as a key for lookup. N/A
editDistanceField no The name of a numeric data field used to store edit distance. null since 2.8.1
spellingThreshold no The maximum allowed edit distance from a correctly spelled word. 230
editCosts no User-defined costs of edit operations. null since 2.9.0
removeDiacriticalMarks no Should the diacritical marks be removed from lookup keys? false
includeBestGuesses no Should the “best guesses” be included in the result? false since 2.8.0
dataFileUrl no The URL of the data file containing the lookup table data. null
dataFileCharset no The character set of the data file. ISO-8859-1

The editDistanceField attribute may be set if the edit distance should be stored within the looked up data records. In such a case, performance of the lookup table may be reduced because data records have to be duplicated in order to store the edit distance. The editDistanceField has to be present within metadata referenced by the metadata attribute. It is advised to enable autofilling (with default_value) on this field in order to preserve data layout of data records stored in the lookup table data file.

The spellingThreshold attribute defines the maximum allowed edit distance from a correctly spelled word. The higher the threshold, the more mistakes can be present in incorrectly spelled words and vice versa. Words with the edit distance above the spellingThreshold are NOT contained in the query result, unless the includeBestGuesses attribute is set to true.

The editCosts attribute may be used to override the default costs of individual edit operations. User-defined edit costs are specified in the form <edit_operation_1> = <edit_cost_1>; …; <edit_operation_n> = <edit_cost_n> where <edit_operation_x> is the name of an edit operation and <edit_cost_x> is its non-negative cost.

The removeDiacriticalMarks attribute specifies whether or not the diacritical marks should be removed from the lookup key before any put/get operation. Setting this property to true should solve any potential problems you might encounter when using words with diacritical marks. However, this may lead to a problem because similar (but different) words might be indexed using a same key. This should not be an issue when similarity searching is performed.

The includeBestGuesses attribute specifies whether or not the “best guesses” (the closest matches when no suggestions were found) should be included in the result. Setting this flag to true results in returning more suggestions if the spelling threshold is too strict.


XML Definition:

<Global>
    ...
    <LookupTable dataFileCharset="UTF-8" dataFileUrl="${PROJECT}/lookup/streets.dat" id="LookupTable0"
            lookupKeyField="name" metadata="Metadata0" name="streets" type="aspellLookup"/>
</Global>
 
<Node charset="UTF-8" id="LOOKUP_JOIN0" joinKey="street" lookupTable="LookupTable0" type="LOOKUP_JOIN">
    <attr name="transform"><![CDATA[//#TL
 
        // Transforms input record into output record.
        function transform() {
            $0.street := $0.street;
            $0.name := $1.name;
            $0.city := $1.city;
        }
 
    ]]></attr>
</Node>

graph_elements/lookups.txt · Last modified: 2010/04/14 18:43 by mjanik
Back to top
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0