1 Fuzzy Grouping
Fuzzy grouping allows you to normalize names within a large list of vendors. The steps below will walk you through set up for Fuzzy Grouping G rouping in Visual Studio 2008. Open Visual Studio 2008. Click on new project.
Choose “New Integration Services Project.” Project.” Once you have the interface open you are ready to begin a new data flow. Drag in your source (in this particular example my data came from EXCEL, but on can also pull data data from ADO, OLE. XML, etc.) Also pull in the transformation you would like (in this case we are using fuzzy grouping), and the destination to which you want your data to go after the transformation.
2 Once you have dragged your source, a transformation, and a destination into the data flow, connect the boxes by dragging the arrow from one box to the other as demonstrated below. Right click the source button (in this example, Excel Source) to open up the Source Editor. Define your source by clicking the “New” Tab and choose the appropriate excel sheet from which the data you are looking to normalize is kept (if the data is in sql, obviously choose the table, etc). In the “columns” tab at the left of the screenshot below, choose the column from which you want to pull the data you would like to normalize. In this example, I selected the excel file “Fuzzy_Grouping” and the spreadsheet “Vendors.” The column name is “Vendor Name.” I also selected vendor code to be included in the output, t hough it won’t be used in the fuzzy grouping transformation.
3 The next step is to define your transformation. Right click the transformation button in your data and choose “Edit.” The chart below will pop up. Click “New” and choose the db that will create your temporary table (any OLE DB connection will work—this example runs off my local sql db (JGUBA01PC), but I could have run it off wwwdasql05 if wanted to as well).
After choosing a connection, click the columns tab. Select the column of data on which you want to run the fuzzy grouping. In this case, check the vendor name box (diagram below). (Hint: Check the box in the column screen that says “Pass Through” for any columns (like vendor code) on which you do NOT want a fuzzy grouping). Click the advanced box to change the degree of accuracy you would like for your grouping. The default setting is 80% but I have found that 80% excludes too many possible matches. 60% match works well.
4
Right click your destination and pick the location for the output. In this case, I am choosing my spreadsheet “Fuzzy Grouping” as my OLE Db. Click “New” to define the output excel spreadsheet and accept Visual Studios prefabricated make table query. I changed the name from “Excel Destination” to “Sheet_1” in the spreadsheet. Be sure to hit the “Mappings” button in the left handcolumn and map the data to the tool’s prefabricated output columns. These columns will appear in “Sheet 1” of the example spreadsheet.
Once you have all three boxes in the data flow defined to a source, a transformation, and a destination, push the green button in the toolbar to run the transformation.
5
If your data flow is constructed properly, all three boxes will turn green.
Your output should be in the excel file (or o ther destination, sql db, etc.) that you defined as your destination. Each entry in the original data set is defined as unique (key in) and matched to data within the data set (key out). A similarity index is generated in the rightmost column by which the user can judge potential matches . I suggest adding another column in excel where the user determines whether the match is correct (I added a column in the accompanying spreadsheet called “new vendor code” --if a match is bad, simply delete the ne w match and return the old vendor code to the “new vendor” column. New mappings (along with a vendor code) can be uploaded to sql as new normalized vendor names.