Extracting a substring from a TEXT field in MySQL
The following SQL snippet extracts the text between <H3> and </H3> in the raw column:
Actually, it only does so when </H3> (or any end string you’re looking for) only occurs once in `raw`. I’ve created a MySQL function to fix this. The function takes 3 parameters:
- Source: the column in which to look for the snippet
- StartTag: the string delimiting the start of what we want
- EndTag: the string delimiting the end of what we want
The function
DROP FUNCTION IF EXISTS `ExtractString` $$
CREATE DEFINER=`root`@`localhost` FUNCTION `ExtractString`(Source text CHARSET utf8, StartTag varchar(1000), EndTag varchar(1000)) RETURNS text CHARSET utf8
BEGIN
DECLARE temp text CHARSET utf8;
IF (LOCATE(StartTag, Source) > 0) THEN
SET temp = SUBSTR(Source, LOCATE(StartTag, Source) + LENGTH(StartTag));
RETURN SUBSTR(temp, 1, LOCATE(EndTag, temp)-1);
ELSE
RETURN NULL;
END IF;
END $$
DELIMITER ;
Usage
Suppose we have a table with a field `raw` and we want to extract the string between ‘##start##’ and ‘##end##’. The following SQL statement would give us this for every row in our table:
SELECT id, ExtractString(raw, '##start##', '##end##') FROM table;
I’m using UTF-8 encoding in my database. The ExtractString function also uses UTF-8. If you are using a different encoding (Latin-1 is the default in MySQL), then alter the function to match your data. If the encoding does not match, StartTag and EndTag string might not match as you expect.
Cheers!
I am a .NET programmer first and foremost. But in my spare time I like to play around with PHP, Erlang, Haskell, F#,