programming.torensma.net: Code Snippets

Extracting a substring from a TEXT field in MySQL

The following SQL snippet extracts the text between <H3> and </H3> in the raw column:

SELECT mid(LEFT(raw, locate('</H3>', raw)-1), locate('<H3>', raw)+4) FROM TABLE


Actually, it only does so when </H3> (or any end string you’re looking for) only occurs once in `raw`. I’ve created a MySQL function to fix this. The function takes 3 parameters:

  • Source: the column in which to look for the snippet
  • StartTag: the string delimiting the start of what we want
  • EndTag: the string delimiting the end of what we want

The function

DELIMITER $$

DROP FUNCTION IF EXISTS `ExtractString` $$
CREATE DEFINER=`root`@`localhost` FUNCTION `ExtractString`(Source text CHARSET utf8, StartTag varchar(1000), EndTag varchar(1000)) RETURNS text CHARSET utf8
BEGIN

  DECLARE temp text CHARSET utf8;

  IF (LOCATE(StartTag, Source) &gt; 0) THEN
    SET temp = SUBSTR(Source, LOCATE(StartTag, Source) + LENGTH(StartTag));
    RETURN SUBSTR(temp, 1, LOCATE(EndTag, temp)-1);
  ELSE
    RETURN NULL;
  END IF;

END $$

DELIMITER ;

Usage

Suppose we have a table with a field `raw` and we want to extract the string between ‘##start##’ and ‘##end##’. The following SQL statement would give us this for every row in our table:

SELECT id, ExtractString(raw, '##start##', '##end##') FROM table;


I’m using UTF-8 encoding in my database. The ExtractString function also uses UTF-8. If you are using a different encoding (Latin-1 is the default in MySQL), then alter the function to match your data. If the encoding does not match, StartTag and EndTag string might not match as you expect.

Cheers!

You can follow any responses to this entry through the RSS 2.0 feed.