Thursday | 10 OCT 2024
[ previous ]
[ next ]

Rob Pike's Regex in BASIC

Title:
Date: 2023-04-26
Tags:  regex, BASIC

I want to write a GREP function for BASIC, there is a SEARCH keyword but it doesn't let you do wildcards and so it is quite limited. I think having a regex search would be helpful, especially when I usually already know what I'm looking for.

I translated Rob Pike's famous example of a regex to BASIC and it was pretty straightforward. The most frustrating thing however is that I don't understand it intuively. The translation looks good and seems to be worknig but the recursive elements are escaping me even now. I may have gotten dumber!

Here is Brian Kernighan's Exegesis which shows and explains Rob Pike's regular expression matcher. I want to understand this truly at some point, I want to be able to re-invent it now that I know it exists.

https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html

A possible future goal would be to see if I can collapse this recursive element into a single file so that I can use it in my editor. Currently I do straight string matches but it would be cool to be able to do a regex from with in my editor.

This the the MATCH function. This is what starts of the regex routines.

    SUBROUTINE MATCH(REGEXP,TEXT,RESULT)
*
    IF REGEXP[1,1] = '^' THEN
        CALL MATCH.HERE(REGEXP[2,999],TEXT,RESULT)
        RETURN
    END
*
    TEXT.LEN = LEN(TEXT)
*
    FOR I = 1 TO TEXT.LEN
        CALL MATCH.HERE(REGEXP,TEXT[I,999],MH.RESULT)
*
        IF MH.RESULT = 1 THEN
            RESULT = 1
            RETURN
        END
    NEXT I
*
    RESULT = 0
*
    RETURN
*
* END OF PROGRAM
*
    END
*

This is the MATCH.HERE function and this is the core engine of the regex logic.


    SUBROUTINE MATCH.HERE(REGEXP,TEXT,RESULT)
*
    IF REGEXP[1,1] = '' THEN
        RESULT = 1
        RETURN
    END
*
    IF REGEXP[2,1] = '*' THEN
        CALL MATCH.STAR(REGEXP[1,1],REGEXP[3,999],TEXT,RESULT)
        RETURN
    END
    
*
    IF REGEXP[1,1] = '$' AND LEN(REGEXP) = 1 THEN
        IF TEXT = '' THEN
            RESULT = 1
            RETURN
        END ELSE
            RESULT = 0
            RETURN
        END
    END
*
    IF TEXT # '' AND (REGEXP[1,1] = '.' OR REGEXP[1,1] = TEXT[1,1]) THEN 
        CALL MATCH.HERE(REGEXP[2,999],TEXT[2,999],RESULT)
        RETURN
    END
*
    RESULT = 0
*
    RETURN
*
* END OF PROGRAM
*
    END
*

Lastly we have the MATCH.STAR function to handle the wildcards in regex.

    SUBROUTINE MATCH.STAR(C,REGEXP,TEXT,RESULT)
*
    TEXT.LEN = LEN(TEXT)
*
    FOR I = 1 TO TEXT.LEN
        CALL MATCH.HERE(REGEXP,TEXT[I,999],MH.RESULT)
*
        IF MH.RESULT = 1 THEN
            RESULT = 1
            RETURN
        END
*
        IF TEXT[I,1] # C AND C # '.' THEN
            EXIT
        END
    NEXT I
*
    RESULT = 0
*
    RETURN
*
* END OF PROGAM
*
    END
*

I wired up the matcher to my GREP program and everything works! Too bad it is stupid slow. The recursive calls are taking too much time from a preliminary test and I'll need to figure out why. It might be that calling subroutines is pretty expensive.

I wired up the matcher in an internal subroutine so I could quickly switch between using the INDEX function and compare it to the matcher function. The difference is huge. I think an improvement will be to remove the recursve aspect and see if I can do it all within this one program.

*
    GIT.FILENAME = 'GREP'
    GIT.REPO = 'https://github.com/Krowemoh/TCL-Utilities.git'
*
* VERSION
*
    VERSION = '1'
*
    @USER1 = 'GREP'
    @USER2 = 'GREP'
*
    EQU TRUE TO 1
    EQU FALSE TO 0
*
    CALL GET.ARGUMENTS(ARGUMENTS)
*
    DICT = ''
*
    IF ARGUMENTS<2> = 'DICT' THEN
        ARGUMENTS = DELETE(ARGUMENTS,2)
        DICT = 'DICT'
    END
*
    ARGS.LEN = DCOUNT(ARGUMENTS,@AM)
*
    IF ARGS.LEN = 1 THEN
        PRINT 'GREP - Regular Expression Search'
        PRINT
        PRINT 'GREP [DICT] {FILE} {RECORD.ID} {REGEX}'
        PRINT
        STOP
    END
*
    IF ARGS.LEN = 2 THEN
        PRINT 'Invalid number of args.'
        STOP
    END
*
    FILE.NAME = ARGUMENTS<2>
*
    OPEN DICT,FILE.NAME TO FILE ELSE
        PRINT 'Unable to open file: ' : FILE.NAME
        STOP
    END
*
    IF ARGS.LEN = 3 THEN
        REGEXP = ARGUMENTS<3>
        SELECT FILE
*
    END ELSE IF ARGS.LEN = 4 THEN
        RECORD.ID = ARGUMENTS<3>
        REGEXP = ARGUMENTS<4>
        SELECT RECORD.ID
    END
*
    IF REGEXP[1,1] = '"' AND REGEXP[LEN(REGEXP),1] = '"' THEN
        REGEXP = REGEXP[2,LEN(REGEXP)-2]
    END
*
    DIM CONTENT(10000)
    MAT CONTENT = ''
*
    LOOP
        READNEXT ITEM.ID ELSE ITEM.ID = ''
    UNTIL ITEM.ID = '' DO
        MATREAD CONTENT FROM FILE,ITEM.ID ELSE MAT CONTENT = ''
        NUMBER.OF.LINES = INMAT()
*
        FOR I = 1 TO NUMBER.OF.LINES
*
            TEXT = CONTENT(I)
*
            GOSUB MATCH.INDEX
*
            IF RESULT THEN
                PRINT ITEM.ID : ': ' : I : ': ' : TEXT
            END
        NEXT I
    REPEAT
*
*********************  S U B R O U T I N E  *********************
*
MATCH.INDEX:NULL
*
    RESULT = FALSE
*
    IF INDEX(TEXT,REGEXP,1) # 0 THEN
        RESULT = TRUE
    END
*
    RETURN
*
*********************  S U B R O U T I N E  *********************
*
MATCH.RECURSE:NULL
*
    RESULT = FALSE
*
    CALL MATCH(REGEXP,TEXT,RESULT)
*
    RETURN
*
* END OF PROGRAM
*
    END
*