Thursday | 10 OCT 2024
[ previous ]
[ next ]

Debugging a Node C Addon

Title:
Date: 2022-03-23
Tags:  

TLDR: strace, valgrind and gdb are your best friends

This will be a fun one. Today I'm going to outline the steps I took to debug my C addon. I haven't actually solved the issue yet so hopefully this will be a living document until I actually finish.

I'm working on an addon to be able to use a multivalue database, d3, from node and this involves wrapping the C code with the node addon api. I already did this for universe and now that I'm doing more work with D3, I thought it would be good to do this for D3 as well.

In a previous post, I created a basic example of the C code to communicate with D3 and then I started working on the node addon. This however led to a strange error right at the beginning. My addon would hang immediately at the log in of D3.

The first thing I did was use strace and I did strace node test.js. Inside test.js was simply an import of my addon and the startsession function.

--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x8} ---

The strace showed the above line at the end over and over again, this means that it was definitely a seg fault but I didn't know why it was going forever.

The next thing I did was use valgrind. This was also quite easy, I could simple do valgrind node test.js and away we go!

Now I got much more helpful errors.

[root@d3linux cdemo]# valgrind ./cdemo bp color
==48003== Memcheck, a memory error detector
==48003== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==48003== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==48003== Command: ./cdemo bp color
==48003==
==48003== Warning: client syscall shmat tried to modify addresses 0x40000000-0xf00a4fff
WARNING: Buffer attachment failed. Error 22
==48003== Invalid read of size 8
==48003== at 0x41E5B4: pick_init_bs_attach (in /home/nivethan/bp/cdemo/cdemo)
==48003== by 0x41A72B: pick_init (in /home/nivethan/bp/cdemo/cdemo)
==48003== by 0x47F768: _CP_logon (in /home/nivethan/bp/cdemo/cdemo)
==48003== by 0x406109: main (in /home/nivethan/bp/cdemo/cdemo)
==48003== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==48003==
==48003==
==48003== Process terminating with default action of signal 11 (SIGSEGV)
==48003== Access not within mapped region at address 0x0
==48003== at 0x41E5B4: pick_init_bs_attach (in /home/nivethan/bp/cdemo/cdemo)
==48003== by 0x41A72B: pick_init (in /home/nivethan/bp/cdemo/cdemo)
==48003== by 0x47F768: _CP_logon (in /home/nivethan/bp/cdemo/cdemo)
==48003== by 0x406109: main (in /home/nivethan/bp/cdemo/cdemo)
==48003== If you believe this happened as a result of a stack
==48003== overflow in your program's main thread (unlikely but
==48003== possible), you can try to increase the size of the
==48003== main thread stack using the --main-stacksize= flag.

==48003== The main thread stack size used in this run was 8388608.
==48003==
==48003== HEAP SUMMARY:
==48003== in use at exit: 170 bytes in 6 blocks
==48003== total heap usage: 72 allocs, 66 frees, 14,471 bytes allocated
==48003==
==48003== LEAK SUMMARY:
==48003== definitely lost: 0 bytes in 0 blocks
==48003== indirectly lost: 0 bytes in 0 blocks
==48003== possibly lost: 0 bytes in 0 blocks
==48003== still reachable: 170 bytes in 6 blocks
==48003== suppressed: 0 bytes in 0 blocks
==48003== Rerun with --leak-check=full to see details of leaked memory
==48003==
==48003== For counts of detected and suppressed errors, rerun with: -v
==48003== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault
[root@d3linux cdemo]#

It looks like somewhere deeper down than my code, in the D3 library code, there is some sort of memory bug. I'm not sure what yet.

Next up, GDB.

> gdb --args node test.js

Now I could single step through my node addon.

I did need to add a break point manually in the addon so that it raises a SIGINT which gdb could then catch,

#include <signal.h>
...
raise(SIGINT)
...

You can place the raise anywhere you want to slowdown and start stepping through. You could also set the break point by specifying the line number but I couldn't figure out how say which file I wanted my break point to be in.

I used the si command to go instruction by instruction and I can see part of the library code but it wasn't much. I could see function names and that was pretty much it.

gdb didn't really help but it was good to know that I could use it pretty easily with node.

At this point I was pretty confident that this is an issue where I need help from the vendor and so I made a forum post which we then moved to a ticket. I was able to create a test case of just a C file and valgrind that showed what I thought was the problem and sent that to the vendor.

This went well as they immediately told me that the first error I'm seeing is actually because valgrind is trying to use the same memory space as D3. It seems D3 has hardcoded memory locations and so the shared memory is an issue. This is probably true for node as well.

The solution was to force valgrind to not use that memory space.

After poking around the documentation, I found an option to force valgrind to use a different memory space.

> valgrind --aspace-minaddr=0x90000000 node test.js

That worked perfectly as that changed my earlier error to something that a forum post had actually already mentioned that I had read.

sudo valgrind --aspace-minaddr=0x90000000 ./test
==3139== Memcheck, a memory error detector
==3139== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3139== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==3139== Command: ./test
==3139==
==3139== Invalid read of size 1
==3139== at 0x90C2CA97: strcpy (vg_replace_strmem.c:510)
==3139== by 0x4182E4: pick_init_termio (in /home/nivethan/bp/d3c/test)
==3139== by 0x417D4F: pick_init (in /home/nivethan/bp/d3c/test)
==3139== by 0x472C88: _CP_logon (in /home/nivethan/bp/d3c/test)
==3139== by 0x405F90: main (http://test.cc:12)
==3139== Address 0xffffffff95fffc50 is not stack'd, malloc'd or (recently) free'd
==3139==
==3139==
==3139== Process terminating with default action of signal 11 (SIGSEGV)
==3139== Access not within mapped region at address 0xFFFFFFFF95FFFC50
==3139== at 0x90C2CA97: strcpy (vg_replace_strmem.c:510)
==3139== by 0x4182E4: pick_init_termio (in /home/nivethan/bp/d3c/test)
==3139== by 0x417D4F: pick_init (in /home/nivethan/bp/d3c/test)
==3139== by 0x472C88: _CP_logon (in /home/nivethan/bp/d3c/test)
==3139== by 0x405F90: main (http://test.cc:12)
==3139== If you believe this happened as a result of a stack
==3139== overflow in your program's main thread (unlikely but
==3139== possible), you can try to increase the size of the
==3139== main thread stack using the --main-stacksize= flag.
==3139== The main thread stack size used in this run was 8388608.
==3139==
==3139== HEAP SUMMARY:
==3139== in use at exit: 114 bytes in 5 blocks
==3139== total heap usage: 71 allocs, 66 frees, 14,415 bytes allocated
==3139==
==3139== LEAK SUMMARY:
==3139== definitely lost: 0 bytes in 0 blocks
==3139== indirectly lost: 0 bytes in 0 blocks
==3139== possibly lost: 0 bytes in 0 blocks
==3139== still reachable: 114 bytes in 5 blocks
==3139== suppressed: 0 bytes in 0 blocks
==3139== Rerun with --leak-check=full to see details of leaked memory
==3139==
==3139== For counts of detected and suppressed errors, rerun with: -v
==3139== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
/var/tmp/sclLSBpCO: line 8: 3139 Segmentation fault 'valgrind' '--aspace-minaddr=0x90000000' './test'

Now there seems to be an issue in strcpy which I'm waiting to hear back about. The forum post unfortunately was moved to a ticket and was never answered. I'll need to make sure to go back and explain what the issue is.

Overall I'm having quite a bit of fun debugging this issue and I'm far more comfortable using valgrind and gdb now. I would love to be able to go all the way but unfortunately the library code I'm relying on is something I don't have access to.