Friday, February 24, 2012

Brute-Force Substring Extraction

In my product description field, I need to extract the data that looks like
19-3-19 or 0-0-7. Sometimes that pattern is in the data, sometimes not.
I only want those that contain the pattern, of course, but I have no other method of determining whether or not the pattern will be in the description.
I thought I'd found a method that worked, until I encountered descriptions
that had a '-' in it prior to the pattern.

Here's what I have so far, it works for all but the last one.

CREATE TABLE #TMPPROD (
PROD_DESC VARCHAR(60)
)
INSERT INTO #TMPPROD
SELECT '22-4-12 15%SCU'
UNION ALL
SELECT '14-4-8 W/F'
UNION ALL
SELECT 'UFLEXX 17-3-17 2%FE'
UNION ALL
SELECT 'FERT-PEST 19-19-19'

SELECT PROD_DESC, SUBSTRING(PROD_DESC,CHARINDEX('-', PROD_DESC)-CASE WHEN PATINDEX('[0-9]', SUBSTRING(PROD_DESC, CHARINDEX('-', PROD_DESC) - 2,1))<>0 THEN 2
WHEN PATINDEX('[0-9]', SUBSTRING(PROD_DESC, CHARINDEX('-', PROD_DESC) - 1,1))<>0 THEN 1
ELSE 0 END, CHARINDEX('-', SUBSTRING(PROD_DESC,
CHARINDEX('-',PROD_DESC)+1,2))+CASE WHEN PATINDEX('[0-9]', SUBSTRING(PROD_DESC, CHARINDEX('-', PROD_DESC) - 2,1))<>0 THEN 2
WHEN PATINDEX('[0-9]', SUBSTRING(PROD_DESC, CHARINDEX('-', PROD_DESC) - 1,1))<>0 THEN 1
ELSE 0 END+2+CASE WHEN PATINDEX('[0-9]', SUBSTRING(PROD_DESC, CHARINDEX('-', PROD_DESC) +1,2))<>0 THEN 2
WHEN PATINDEX('[0-9]', SUBSTRING(PROD_DESC, CHARINDEX('-', PROD_DESC) - 1,1))<>0 THEN 1
ELSE 0 END) AS 'ANALYSIS'
FROM #TMPPROD
WHERE PROD_DESC LIKE '%-[0-9]%-[0-9]%'

DROP TABLE #TMPPROD

Any pointers on what to try next, or have I severely over-complicated the process - as I am wont to do.

Thanks
MarkSorry fella - I could not be bothered figuring out the DML.
When you say:
19-3-19 or 0-0-7
do you mean literals (i.e. these are the ONLY two values the field might contain) or these are examples? I suspect the latter since the former is trivial but to be sure...|||Yes, it's the latter. The numbers themselves can be anywhere between 0 and 62|||Just in case it helps, the statement above returns:

Prod_Desc...............Analysis
22-4-12 15%SCU........22-4-12
14-4-8 W/F.............14-4-8
UFLEXX 17-3-17 2%FE...17-3-17
FERT-PEST 19-19-19.....-P

That last one aint what I'm looking for|||Okey doo - ta :)

So - I don't think this is a problem:
I thought I'd found a method that worked, until I encountered descriptions
that had a '-' in it prior to the pattern.Unless the pattern can also be followed by digits and/ or "-".

Am I being a thicky pants?|||Unless the pattern can also be followed by digits and/ or "-".

Not sure I follow, but... Anything that follows the pattern is immaterial.
I tried some test data with an extra '-' farther down the line, and it still worked as expected.|||DECLARE @.TMPPROD TABLE (PROD_DESC VARCHAR(60))

INSERT @.TMPPROD
SELECT '22-4-12 15%SCU' UNION ALL
SELECT '14-4-8 W/F' UNION ALL
SELECT 'UFLEXX 17-3-17 2%FE' UNION ALL
SELECT 'No-valid-data 17-17 x' UNION ALL
SELECT 'FERT-PEST 19-19-19'

SELECT LEFT(d.q, CHARINDEX(' ', d.q + ' '))
FROM (
SELECT SUBSTRING(PROD_DESC, PATINDEX('%[0-9]%-%[0-9]-%[0-9]%', PROD_DESC), 60) AS q
FROM @.TMPPROD
) AS d
WHERE d.q LIKE '[0-9]%-%[0-9]-%[0-9]%'|||Ok, I'm going to preface this with the assertion that the code is uglier than a mud fence because it is doing a "brute force" dance around what I see as pattern matching limitations in the product. I also take a few liberties in accepting any one or two digits, not just 1 through 62, although you could also work around that easily enough if it was important to you.-- 20070828 ptp Find/return a CNO formatted substring inside an argument string

CREATE FUNCTION dbo.fCNO(@.pArg VARCHAR(8000))
RETURNS VARCHAR(9) AS
BEGIN
DECLARE
@.cResult VARCHAR(9)
, @.i INT

SET @.cResult = NULL

SET @.i = PatIndex('%[0-9]%-[0-9]%-[0-9]%', @.pArg)
WHILE @.cResult IS NULL AND 0 < @.i
BEGIN
SET @.pArg = SubString(@.pArg, @.i, 8000)
SET @.cResult =
CASE
WHEN @.pArg LIKE '[0-9][0-9]-[0-9][0-9]-[0-9][0-9]%' THEN SubString(@.pArg, 1, 8)
WHEN @.pArg LIKE '[0-9][0-9]-[0-9][0-9]-[0-9]%' THEN SubString(@.pArg, 1, 7)
WHEN @.pArg LIKE '[0-9][0-9]-[0-9]-[0-9][0-9]%' THEN SubString(@.pArg, 1, 7)
WHEN @.pArg LIKE '[0-9]-[0-9][0-9]-[0-9][0-9]%' THEN SubString(@.pArg, 1, 7)
WHEN @.pArg LIKE '[0-9][0-9]-[0-9]-[0-9]%' THEN SubString(@.pArg, 1, 6)
WHEN @.pArg LIKE '[0-9]-[0-9][0-9]-[0-9]%' THEN SubString(@.pArg, 1, 6)
WHEN @.pArg LIKE '[0-9]-[0-9]-[0-9]%' THEN SubString(@.pArg, 1, 5)
ELSE NULL
END

SET @.i = PatIndex('%[0-9]%-[0-9]%-[0-9]%', @.pArg)
END

RETURN @.cResult
END
GO

CREATE TABLE #TMPPROD (
PROD_DESC VARCHAR(60)
)
INSERT INTO #TMPPROD
SELECT '22-4-12 15%SCU'
UNION ALL
SELECT '14-4-8 W/F'
UNION ALL
SELECT 'UFLEXX 17-3-17 2%FE'
UNION ALL
SELECT 'FERT-PEST 19-19-19'

SELECT dbo.fCNO(PROD_DESC), PROD_DESC
FROM #TMPPROD
WHERE dbo.fCNO(PROD_DESC) IS NOT NULL

DROP TABLE #TMPPROD-PatP|||WOOHOO!
Two working solutions. :)
Thanks guys!|||Just be careful because Peso's solution runs quicker than mine does, but it also returns SSNs like 123-45-6789 and United States phone numbers like 312-555-1212. I read about a guy having that problem in this book once!

-PatP|||Yes, I also found it returns a few oddball items like "19-19-19F" or
if there is a series of 5 or more numbers prior to the pattern. That is
helpful in finding descriptions that need some maintenance, but ultimately
I don't want that showing up in my reports.|||-- Create and stage a permanent valid patterns table
CREATE TABLE #Patterns
(
Pattern VARCHAR(12) PRIMARY KEY CLUSTERED
)

INSERT #Patterns
(
Pattern
)
SELECT '% ' + CONVERT(VARCHAR(2), v1.Number) + '-' + CONVERT(VARCHAR(2), v2.Number) + '-' + CONVERT(VARCHAR(2), v3.Number) + ' %' AS x
FROM master..spt_values AS v1
CROSS JOIN master..spt_values AS v2
CROSS JOIN master..spt_values AS v3
WHERE v1.Type = 'p'
AND v1.Number BETWEEN 0 AND 62
AND v2.Type = 'p'
AND v2.Number BETWEEN 0 AND 62
AND v3.Type = 'p'
AND v3.Number BETWEEN 0 AND 62

-- Prepare sample data
DECLARE @.TMPPROD TABLE (PROD_DESC VARCHAR(60))

INSERT @.TMPPROD
SELECT '22-4-12 15%SCU' UNION ALL
SELECT '14-4-8 W/F' UNION ALL
SELECT 'UFLEXX 17-3-17 2%FE' UNION ALL
SELECT 'No-valid-data 17-17 x' UNION ALL
SELECT 'FERT-PEST 19-19-19'

-- Show the expected output
SELECT p.PROD_DESC,
PATINDEX(pat.Pattern, ' ' + p.PROD_DESC + ' ') AS Position
FROM #Patterns AS pat
INNER JOIN @.TMPPROD AS p ON ' ' + p.PROD_DESC + ' ' LIKE pat.Pattern|||That approach is better than the one that I posted with two caveats... First of all, it misses leading zeros which tend to appear relatively often in live CNO data although there aren't any in RedNeckGeeks sample data. As a side issue, I think that slows the selection process down quite a bit, so I think that will run a good bit slower.

-PatP|||Add this you your approach and you are more set!
2 combinations on three places gives 8 variations in total.

Here is the eight

WHEN @.pArg LIKE '[0-9]-[0-9][0-9]-[0-9]%' THEN SubString(@.pArg, 1, 6)|||As a side issue, I think that slows the selection process down quite a bit, so I think that will run a good bit slower.Yes, it is slower.

BUT.. It handles the 0-62 only condition...

No comments:

Post a Comment