I created this small SQL Server function for summarizing patterns of data in fields. It comes in useful when the data is messy and needs to be standardized. For example, phone numbers can come in different formats: 123-123-1234, (123) 123–1234, (123) 1231234. The patternMask function can summarize the patterns of the values in the field.
It works by replacing lower case letters with a lower case 'x', upper case letters with an upper case 'X', numbers with a '#', and non-alphanumeric characters with a '*'. Alternatively, the non-alphanumeric characters can be returned as they are.
SELECT field1, [dbo].[patternMask](field1, 0) 'without', [dbo].[patternMask](field1, 1) 'with'
FROM table1
field1 | w/o non-alphanumeric mask | w non-alphanumeric mask |
---|---|---|
(()76&^%0(45eDD | (()##&^%#(##xXX | ***##***#*##xXX |
SELECT [dbo].[patternMask](field1, 0) Collate SQL_Latin1_General_CP1_CS_AS, COUNT(*)
FROM table1
GROUP BY [dbo].[patternMask](field1, 0) Collate SQL_Latin1_General_CP1_CS_AS