Quantcast
Channel: SsisBatchGeocoder Wiki Rss Feed
Viewing all articles
Browse latest Browse all 17

Updated Wiki: Documentation

$
0
0

Overview

The SSIS Batch Geocoder is a SQL Server Integration Services data flow transformation that geocodes address information within the SSIS pipeline using the geocoding capabilities of the Bing Maps Geocode Dataflow. The transformation accepts search key data on its input and presents location data on its output in the form of rooftop (if available) and interpolated latitude and longitude. All additional input columns are passed through to the output. One such column, the Entity ID column, must contain a user-supplied unique identifier for each address.

In addition to the location data, the geocoding process produces validated and normalized address information. This data may optionally be included in the output along with the locations. The search key columns (e.g. street address, city, zip code, full address) may optionally be passed through to the output as well.

Address information within the pipeline is buffered and then uploaded to Bing Maps in batches. Once Bing Maps geocodes the addresses, the data is downloaded and reintegrated back into the SSIS pipeline using the supplied unique Entity ID values. During reintegration, location data and, optionally, the validated address information joins back with the original input data and fed to the next pipeline component.

The input address information columns can include:
  • Culture - the cuture code, such as en-US.
  • Query - a complete address query contained in a single field.
  • AddressLine - the requested address line (e.g. number and street).
  • Locality - the requested populated place, such as city.
  • PostalTown - the requested postal city of an address
  • AdminDistrict - the requested subdivision within a country or region, such as state
  • PostalCode - the requested postal code, such as zip code
  • District - a higher level administrative subdivision used in some countries or regions
  • CountryRegion - the requested country or region

The input columns containing these fields can have any names that do not conflict with the names of added output columns (listed below). Custom properties serve to map selected input columns to particular address information fields.

In the United States, a query would typically consist of the Culture field "en-US" and either the Query field or the AddressLine, Locality, AdminDistrict, and PostalCode fields. If the Query field is defined and is not empty then it overrides any other address information fields other than Culture.

The location data columns added to the output include:
  • RooftopLatitude& RooftopLongitude - exact location as double precision floating point coordinates, when available.
  • InterpolatedLatitude& InterpolatedLongitude - interpolated location as double precision floating point coordinates.
  • StatusCode - indicates geocoding success or failure.
  • Confidence - the confidence level assigned by Bing Maps, High, Medium, or Low. If the request failed, this column contains the fault reason.
  • EntityType - the resolution at which geocoding was performed (e.g. Address = found a specific address). See http://msdn.microsoft.com/en-us/library/ff728811.aspx for a list of entity types.

The validated and normalized columns that may optionally be added to the output include:
  • ValidatedFormattedAddress - the complete formatted address as a single field
  • ValidatedAddressLine - the address line (e.g. number and street)
  • ValidatedLocality - the populated place, such as city
  • ValidatedPostalTown - the postal city of an address
  • ValidatedAdminDistrict - the subdivision within a country or region, such as state
  • ValidatedPostalCode - the postal code, such as zip code
  • ValidatedDistrict - a higher level administrative subdivision used in some countries or regions
  • ValidatedCountryRegion - the country or region

The input columns that provide the address information to be geocoded are suppressed from the output by default, but may optionally be passed through.

A list of all input and output fields of the Bing Maps Geocode Dataflow can be found at http://msdn.microsoft.com/en-us/library/ff701736.aspx.

Capabilities

  • Geocodes addresses directly in the SSIS pipeline.
  • Performs geocoding in batches as opposed to row-by-row.
  • Leverages the geocoding capabilities of the Bing Maps Geocode Dataflow.
  • Outputs location data as double precision floating point coordinate pairs that can readily be converted to spatial data types.
  • Can select which input columns contain address information (AddressLine, PostalCode, etc).
  • Input address information columns can be passed through to the output or suppressed.
  • Validated address information from the geocoder can be added to the output.
  • Package variable based configuration of Bing Maps settings to allow environment specific configuration.
  • Pass-thru of additional input columns of data (non address data) in the output pipeline.
  • Data columns added to output can be Unicode or can be assigned a code page, so that further conversion is unnecessary

Prerequisites

SQL Server 2008 R2 Integration Services (has not been tested with earlier versions)
.NET Framework 3.5

For source compilation:
Visual Studio 2010 C# (for code modification and compilation)

Installation

  1. Install the SSIS Batch Geocoder component DLL on the machine that will be running SSIS (server or developer workstation) using the setup package appropriate to the platform (x64 or x86). If a previous version is installed, uninstall it first. If the machine will be used to edit SSIS packages that contain the component then continue with the following steps.
  2. Open Business Intelligence Design Studio
  3. Display the Toolbox window (it need not contain any items)
  4. Right click within the toolbox window and select "Choose Items…".
  5. Select the "SSIS Data Flow Items" tab.
  6. Choose "SSIS Batch Geocoder" from the list of components, and click OK.

The component should now be available from the list of Data Flow Transformations in the toolbox when editing a data flow within an SSIS package project. It will not be visible at other times, such as when editing a different type of project or when editing a control flow.

Usage

The SSIS Batch Geocoder works just like any other transformation component; once you add a Data Flow task to the Control flow design surface, drag and drop the "MapPoint Batch Geocoder" from the toolbox and place it somewhere between a source and destination. After connecting the transformation to upstream (input) item, configuration of the geocoder can begin.

Initially, all input columns are selected for use. You can deselect any that you do not wish to use with or pass through the component. You can also use the advanced editor to rename columns, but after doing so you will have to manually update any custom properties that were referencing the old names.

Transformation Configuration

Double click the component on the data flow design surface to open the Advanced Editor.

Two configuration options, the geocode batch job description and the Bing Maps Key, take their values from package variables. Custom properties allow you to select which variable will supply the values. If the package already contains user variables named JobDescription and BingMapsKey then these variables will be selected by default when the package is created, but you can still select other variables.

The available custom properties include:
  • Add Search Key Columns To Outputs - True/False. If true, the selected address information input columns are passed through to the output.
  • Add Validated Address Columns To Outputs - True/False. If true, the validated and normalized address information returned from the geocoder is added to the output.
  • Added Output Columns Code Page - integer value (0 - 32767) specifying the code page for string-containing output columns added by SSIS Batch Geocoder, where 0 = use Unicode. An attempt is made to "guess" the appropriate value based on available input columns, but the original setting may not always be correct and should always be checked. This property lets you simplify the data flow by eliminating the need to convert string column types downstream.
  • Job Description - the optional name of a variable containing a job description that is added to the submitted job. May be useful for accounting or debugging purposes.
  • BingMapsKey - The Bing Maps key. You can create an account and a key at https://www.bingmapsportal.com.
  • Bing Maps Minimum Confidence - The minimum confidence that can result in a successful match. Choices are High, Medium, Low, and Default. Choosing Default results in no minimum confidence being passed to the geocoder. When a Query input is used (Query column defined and column data is not null or empty), minimum confidence is not used.
  • Source AddressLine - The input column to use as the AddressLine.
  • Source Administrative District - The input column to use as the AdminDistrict (state).
  • Source Composite Query - The input column to use as the Query.
  • Source CountryRegion - The input column to use as the CountryRegion (country).
  • Source Culture - The input column to use as the Culture.
  • Source District - The input column to use as the District.
  • Source Entity ID - The input column to use as the EntityID.
  • Source Locality - The input column to use as the Locality (city).
  • Source PostalTown - The input column to use as the PostalTown.
  • Source PostalCode - The input column to use as the PostalCode (ZIP code).

If, at the time the component input is attached to an upstream component, a column exists whose name corresponds to one of the Bing Maps address information fields, such as Culture, Query, AddressLine, Locality, etc., the corresponding custom property will be set automatically. By naming the input columns appropriately, you can save some setup time.

If input columns exist with names that cause an input column name property to be set automatically, but you don't want a particular column to be used, change that input column assignment to point to a different input column name, or to <Blank>.

With care, you can remove unused input columns using the Input Columns tab of the Advanced Editor. If you do this, make sure that no Custom Property on the Custom Properties tab is still pointing to a column that has been removed, and also check the Input and Output Properties tab and remove the corresponding column from the Output Columns node (under Output) if it exists. This cleanup really should be done automatically by the component, and the issue may be addressed in a future release.

Example Files

Four example files can be found in a separate download:
  • GeocoderExample.dtsx - The example SSIS package
  • ClinicLocationData.txt - Input data for the package
  • GeocodesFromQuery.txt - Sample output from the package
  • GeocodesFromFields.txt - Sample output from the package

These files include a small random sample of address data obtained from https://www.alirts.oshpd.ca.gov/AdvSearch.aspx. It is stored as a tab-delimited text file. The sample flow creates two output files (also tab-delimited text), one based on a composite query ("Query" column) and another based on multiple individual search fields (AddressLine, Locality, PostalCode, and AdminDistrict). Both optional column set outputs (Search Key and Validated Address) are enabled to provide the maximum amount of information in the output files. A pass-through column, FacilityName is also included. This column simply passes through from input to output.

To use these files
  1. Create a new Business Intelligence Development Studio Integration Services project.
  2. Copy the above files to the project directory.
  3. Delete the default package that was created with the new project.
  4. Use "Add existing item to add GeocoderExample.dtsx and ClinicLocationData.txt to the project.
  5. Make sure that the SSIS Batch Decoder transformation component has been added to the designer toolbox, as described under "Installation," above.
  6. Open the GeocoderExample.dtsx package.
  7. Open the Clinic Locations connection manager and make sure it points to ClinicLocationData.txt. Browse to the file if necessary, and then close the connection manager.
  8. In a similar manner open the Geocodes From Query Connection and Geocodes From Fields Connection connection managers and browse to the corresponding text files if necessary.
  9. Right click the design surface and select "Variables..."
  10. Change the Value column of the BingMapsKey to be your own key. See BingMapsKey under Transformation Configuration, above.
  11. Run the package.

You may want to try deleting the sample batch geocoder component and adding your own to the sample flow. The column and variable names have been chosen to automatically be configured when you create a new instance of the component.

Logging

Log events are generated during geocoding and can be captured using standard SSIS logging. All events are reported under the OnInformation event.

Design Considerations

Taken from the original MapPoint Batch Geocoder documentation

Isolating the Geocoding Process

The possibility of connectivity failure increases anytime Internet-based services are involved, as is the case with this component’s dependency on the Bing Maps service. The ETL processes can be designed so that a geocoding failure does not cause the entire ETL process to fail, particularly if the ETL process is long and difficult/tedious to restart.

Instead of performing geocoding as an extra data flow step in an already complex package, a more resilient design is to have a separate package which performs the geocoding against records which have already been processed by ETL and landed in the target database. The package will only geocode records which have yet to be geocoded, indicated by a null or flag. Once the geocoding is complete, the package updates the existing rows and clears the flag.

Because of the design, the geocoding package can be executed out-of-band, which handles the case of a network failure or even a historical reload of data.

Data Cleansing

Due to the nature of how address information is entered by operational systems, it is quite common for this information to contain errors (formatting, spelling, typos, etc). However, the SSIS Batch Geocoder will not perform any data cleansing of address information before submitting to Bing Maps. Since error-prone address information may yield poor or incorrect geocoding results, consideration for adding data cleansing practices to ETL processes should be made.

The EntityType, StatusCode and Confidence (FaultReason) values returned by Bing Maps can be used to identify records that had an issue being geocoded. A decision should be made for each possible return of whether the quality of the geocode is useful enough to be consumed by downstream systems and whether further address cleansing (i.e. manual intervention) is required.

Controlling the Geocode Dataflow Batch Size

For performance reasons, geocoding is performed in batch mode. The batch size is controlled by the SSIS buffer controls DefaultBufferMaxRows and DefaultBufferSize properties of a data flow task. Managing the batch size is useful for achieving the right balance between geocoding too much data at once (i.e. all is lost in case of failure, or Bing Maps limitations), and overall process performance.

Bing Maps Geocode Dataflow Limitations

See http://msdn.microsoft.com/en-us/library/ff701733.aspx.

Manual Installation

The source code project includes a Deploy.bat file containing the commands required to install the project's output DLL. This file is configured to run automatically as a post-build step. If you wish to install the DLL manually, the steps are:
  1. Compile the SSIS Batch Geocoder transformation component.
  2. Copy the transformation component (SSIS Batch Geocoder.dll) to the PipelineComponents folder for SSIS (i.e. "C:\Program Files\Microsoft SQL Server\100\DTS\PipelineComponents\"). On x64 platforms, also copy the file to the same location under c:\Program Files (x86).
  3. Add the assembly to the Global Assembly Cache (GAC). (See the article 'How to: Install an Assembly into the Global Assembly Cache' - http://msdn.microsoft.com/en-us/library/dkkx7f79.aspx, or see Deploy.bat.)

Note that if the component has been installed using the setup package, the gacutil -u command in Deploy.bat will fail. Uninstall the setup package before using Visual Studio to build (and install) the project, to avoid build errors.


Viewing all articles
Browse latest Browse all 17

Latest Images

Trending Articles





Latest Images