Monday | 15 APR 2024
[ previous ]
[ next ]

Multithreading BASIC

Title:
Date: 2023-04-24

The Spookiest Problem

My tech stack takes 75 seconds to generate 430 blog posts which seems like an extremely long time. Especially when I want to generate and view my site immediately. One solution to speed up my site generation would be to cache the md5 sum of a file so that I can check if it has changed and only call pandoc if there is change.

The other solution is to to split the posts into multiple lists and render everything all at once. This would involve using the phantom command to execute a series of commands and then keeping track of the jobs with a control file counter. The counter would be locked and incremented and then released when the job starts and once the job is over, it will be locked and decremented. Then the parent process can loop around a read of that control file record until the counter is equal to 0.

This isn't the optimal solution but it would work and I think it would be a good learning experience even if the caching solution is the ultimate answer.

A Solution - PHANTOM Jobs

The first step is to add a BLOG.CTR to the CONTROL-FILE and to read that in and then write it back out in my GENERATE.POSTS subroutine. This is what I will use to keep track of running processes. Once the BLOG.CTR hits 0, it means that there is currently no jobs running.

    READU BLOG.CTR FROM CONTROL.FILE,'BLOG.CTR' THEN
        WRITE BLOG.CTR+1 ON CONTROL.FILE,'BLOG.CTR'
    END

This is at the start of a job, so we increment the counter. We will then decrement the counter once the subroutine is read to exist.

The next step is to split my list into multiple chunks. PICK unfortunately doesn't have the concept of doing a selects and paging them. There is no LIMIT keyword so we will need to wire it up manually. This means that a select is done and then the ids are chunked in code.

I currently do:

EXECUTE 'SSELECT BLOG-POSTS BY-DSND DATE BY-DSND SEQ'
EXECUTE 'RUN BP GENERATE.POSTS' PASSLIST

This results in 433 records getting selected and then I use pandoc to generate each post sequentially.

To chunk the active list:

*
    EXECUTE 'SSELECT DG BY-DSND DATE BY-DSND SEQ' CAPTURING RESULTS
*
    NUMBER.OF.THREADS = 5
*
    JOBS = ''
    TOTAL.JOBS = @SELECTED
*
    JOBS.PER.THREAD = INT(TOTAL.JOBS / (NUMBER.OF.THREADS-1)+0.5)
*
    CTR = 0
    JOB.CTR = 1
    DONE = FALSE
*
    LOOP
        READNEXT ITEM.ID ELSE DONE = TRUE
    UNTIL DONE DO
        JOBS<JOB.CTR,-1> = ITEM.ID
        CTR  = CTR + 1
*
        IF CTR = (JOBS.PER.THREAD * JOB.CTR)  THEN
            JOB.CTR = JOB.CTR + 1
        END
    REPEAT
*
    FOR JOB.CTR = 1 TO DCOUNT(JOBS,@AM)
        PRINT DCOUNT(JOBS<JOB.CTR>,@VM)
    NEXT JOB.CTR
*

I set up the number of threads I want to use. In this case, we are going try with 5. Next I set up the jobs variable which will contain an attribute marked list of value marked lists of ids. This means that jobs will have 5 lists inside of it.

I get the total number of ids that are currently selected. That will be the number of jobs I need to process. Now we can calculate the number jobs that each thread will be responsible for.

I then loop through the active list and chunk the list accordingly.

At the end of this code, I have a jobs list that I can now use to spin up phantoms.

*
    FOR JOB.CTR = 1 TO NUMBER.OF.THREADS
        CLEARSELECT
        WORK = RAISE(JOBS<JOB.CTR>)
        SELECT WORK
        PRINT @SELECTED
        EXECUTE "PHANTOM GENERATE.POSTS" PASSLIST
    NEXT JOB.CTR
*

We can loop from 1 to the number of threads that we have and now we can select each set of jobs. This is the work that each subroutine will do. We then execute a phantom job. GENERATE.POSTS is a BASIC program that is cataloged so it can be run directly from TCL.

Unfortunately, it looks like you can't just use a select list and then use a phantom to process it. This means that I need to write out the list to the &SAVEDLISTS& file and then pick it up in GENERATE.POSTS.

*
    LIST.NAMES = ''
*
    FOR JOB.CTR = 1 TO NUMBER.OF.THREADS
        CLEARSELECT
*
        LIST.ID = TIME() : '.CTR'
        LIST.NAMES<-1> = LIST.ID
*
        WORK = RAISE(JOBS<JOB.CTR>)
        WRITE WORK ON SAVED.LISTS,LIST.ID
*
        EXECUTE 'PHANTOM GENERATE.POSTS ' : LIST.ID
    NEXT JOB.CTR
*

I now write out each set of jobs and save the list id and pass it to the GENERATE.POSTS program. Inside the program I do a GET-LIST based on the LIST.ID.

This works pretty well now. The last step is to monitor the BLOG.CTR in the CONTROL-FILE so that I can tell when the posts are all generated. Once everything is generated I can also delete the savedlists that I created.

*
    DONE  = FALSE
*
    LOOP UNTIL DONE DO
        READ BLOG.CTR FROM CONTROL.FILE,'BLOG.CTR' ELSE DONE = TRUE
*
        IF BLOG.CTR # 0 THEN
            SLEEP 2
        END ELSE
            DONE = TRUE
        END
    REPEAT
*
    FOR I = 1 TO DCOUNT(LIST.NAMES,@AM)
        DELETE SAVED.LISTS,LIST.NAMES<I>
    NEXT I
*

I check the CONTROl.FILE record in a loop to see if the BLOG.CTR hits 0. Once it does hit 0, I can then exit the loop and delete the saved lists.

Voila! We have now finished up the multithreading my blog generation. Too bad that it still runs terribly. I think something is fundementally wrong in my implementation as having a thread count of 5 resulted in only cutting the time by half. I was expecting to see it getting cut far more. Spinning up 40 phantoms resulted in it taking 13 seconds which is still ridiculously slow.

It's likely there is something I have misunderstood about phantoms. I'll leave that for another day though. For now I have something :)