Kettle (now known as Pentaho Data Integration)

I've recently begun undertaking some development in Kettle (actually GeoKettle, the geospatial enabled version of Kettle), version 3.2. I've been writing some geocoding components that will take an address and pass back latitude/longitude, using ESRI software. I wrote these for the benefit of my company and the community, and because it was fun.

Components I have written:

  • ESRI Geocoder Plugin - geocode addresses
  • Create a Point Geometry - create a new Geometry field that contains a point, using two other fields in your data. (this is possible to do another way, but this one is faster as it uses native conversion versus WKT.
  • Spatial Universe Match - Perform a spatial match on a set of targets against a universe of geometries. Supports Within, Contains, Intersects, Equals, Crosses, Overlaps, Touches, and Disjoint
  • Shapefile Output (based on the one that came with GeoKettle 3.2, but that one had some serious flaws). - output to a shapefile.

Eventually you will be able to download and use these, when I get around to writing up documentation on them and packaging them up.

Tips

Development

Miscellaneous

Here are some things I have learned as I have gone about writing new components. I'll expand this section as I go.

  • When it comes to data types, if you specify the type of data in your metadata as ValueMetaInterface.TYPE_INTEGER, you better actually have Java Long as your object type. Take a look at the source code for Kettle, specifically src-core/org/pentaho/di/core/row/ValueMeta.java. You'll see that TYPE_INTEGER is cast as a Long. Anything other than a Long in your object will throw the useless error message “There was a data type error: the data type of <your java class> object [the value] does not correspond to value meta [whatever you defined as the Kettle type]”
    • These are the Kettle types and the corresponding java classes. All your data should fall into one of these buckets:
TYPE_STRINGString
TYPE_DATEjava.util.Date
TYPE_NUMBERDouble
TYPE_INTEGERLong
TYPE_BIGNUMBERjava.math.BigDecimal
TYPE_BOOLEANBoolean
TYPE_BINARYbyte[]
TYPE_SERIALIZABLEuses object.toString()
  • When you are setting up your metadata, here is a handy function that does most of what you need, along with an example. These go in your “Meta” class (i.e. the one that extends BaseStepMeta implements StepMetaInterface):
    • 	private void addOutputField(RowMetaInterface r, String origin, String fieldName, int fieldType, int length, int precision, Object defaultValue, String conversionMask) {
      		ValueMetaAndData newField;
      		ValueMetaInterface v;
      		newField = new ValueMetaAndData(new ValueMeta(fieldName, fieldType), defaultValue);
      		newField.getValueMeta().setLength(length);
      		newField.getValueMeta().setPrecision(precision);
      		if (conversionMask != null) {
      			newField.getValueMeta().setConversionMask(conversionMask);	
      		}
      		
      		v = newField.getValueMeta();
      		v.setOrigin(origin);
      		r.addValueMeta(v);
      	}
      
      	public void getFields(RowMetaInterface r, String origin, RowMetaInterface[] info, StepMeta nextStep, VariableSpace space)
      	{
      		addOutputField(r, origin, "latitude", ValueMetaInterface.TYPE_NUMBER, 18, 15, new Double(0), "###.###############");
      	}
      
  • The meta data values for length and precision don't have to be specified (i.e. you can omit the calls to setLength() and setPrecision() in the above example). They are useful for stings, when outputting values, the string will be padded to the length specified. For numbers, you can leave them out if you want, although I don't recommend it. These values tell the end user of your component what to expect. The most important piece of metadata for numbers is the format string. Supplying a default value here will help the end user out a lot I think. Although the user can change your format string, if they don't, and you don't have a default one, high precision numbers will be output with only 2 decimal places.