blog.silas.nyc

Adding Directionality

March 20th, 2025

Today I spent time working on adding structure to the network. I’m abstracting Clusters and Neurons to a NetworkNode base type, then will create another derived type Layer. A Network will be composed of NetworkNodes, mostly Layers. Layers will be composed of Clusters, and Clusters of Nodes. Clusters will still interlink internally in random directions, but Layers will only link in one direction.

I started putting together NetworkNode and reworking Neuron and Cluster, but it’s not even in a state worth checking in as a WIP yet. I’ll have to keep going on this for a few days, I think.
Some Thoughts on Test Strategy

March 19th, 2025

At my last job, I found myself in the position of working on testing and test automation again, and the experience got me thinking about how to improve the process and prioritization of building and maintaining a testing effort. I wanted to capture that somewhere before it completely leaves my head, so I thought I’d put it here.

There are two main areas that I came up with to frame the exploration of this topic. First, value; how do tests deliver value to the development team and the business. Second, costs; what are the ways that building and running tests cost the team. All this within an eye toward DORA alignment, both looking at the testing effort as a self-standing development process, and in terms of how it ties in to those metrics for the product, and where it touches on the points of value and cost. Finally, the metrics are designed with a bias toward things that are concrete data points and can be measured with data available from existing tools, like ticketing systems and test result records, so the regime could be used practically to weigh priorities when deciding what to write, what to improve, and what to retire.

Value

The core factor in how tests deliver value is in giving confidence. Confidence that if the tests pass, the system works, and if they don’t, the system is broken. Confidence that if it is broken, they tell us how, and how to fix it. From that premise, there are some metrics to be derived. These metrics apply to automated testing, but also manual test documentation.

in service ratio – Out of every opportunity to generate a result from a test (roughly speaking, a test execution), what is the ratio of it generating a signal on the state of the product vs not. This can further be broken down into explicit (a test is taken out of service for maintenance, or generates a test failure) and implicit (a test is too hard, flakey, or long running to run frequently, or there’s an environment or data problem that precludes it running, etc). This should be collectable from test result data. Version control data for the test repo may also be useful.

product failure rate – Out of every potential execution of the test, how often does it surface a product issue, including if it is put out of service pending a product fix. Remember this is a value of the tests, not the product. This could also be weighted by the severity and/or priority of the issues identified. This should be collectable from ticketing/work management systems, with ticket linking and maybe a custom field or two on tickets.

target functionality value – How important is the functionality the test targets to the business. This can be because it is critical path, or because it blocks a lot of other functionality/tests. This would have to be judged by the team, with some way of maintaining consistent meaning for ranks/scores.

Cost

On the opposite side of the equation, there is the cost of building and maintaining a test case and suite, as well as the indirect cost of things like maintaining data, test hooks/harnesses, and even product features that support tests being run in the system, or that have restrictions placed on implementation or change to support testing efforts. Expanding out from that though, there are the costs that reduce confidence. Since the value of tests is providing confidence in the correctness of the product, anything that reduces that confidence is a cost borne by the organization as a whole.

initial development cost – How much will/did it cost to develop the definition, supporting data/processes, and automation if applicable. This should be available from ticketing systems and version control systems for automation.

execution time – How long does it take to run this test. This is relevant to manual tests, as it translates directly to work hours. It is also relevant to automation, as it determines how often the test can be run, and also how far back in the delivery chain (ex local unit/integration tests on every commit or file save vs run as a part of CI vs run as part of a periodic sweep) the tests can practically be included. This data should be available from a combination of test results, potentially CI logs, and ticketing systems.

maintenance cost – How much time is put in to maintaining the test, directly or indirectly. This includes updating manual definitions, adjusting tests that break, maintaining data the tests are dependent on, maintaining any automation, and maintaining any enabling hooks and test features in the product. This is a bit harder to track as it becomes harder to apportion these costs as many maintenance activities will have bearing on multiple tests, but with a little effort and diligence, it should be doable within a ticketing system, with linking and a custom field or two.

out of service ratio – Slightly different than the in service ratio, rather than being a measure of how many actual executions vs idealized possible executions, this metric measures the ratio of actual executions against actual attempted executions. Basically, this measures how long a test spends in maintenance. There are two variations here. The first is solely based on how often a test is not executed or fails due to a test issue, and tells us how consistently we can rely on the coverage the test provides. The second includes being out of service because of a blocking product issue. The second form is of course directly tied to the turn around time for bug fixes in the product, but it can provide a signal on what tests may need to be revised to narrow their scope (a test which unnecessarily couples features, so either feature having a bug blocks both features from being covered, for example) or change their approach to be more isolated around the targeted behavior. This data should be available from test result logs, provided they include failure categorization (ex product vs test issue). For automated tests, it could also potentially make use of data from version control systems.

uncaught issue rate/count – How often, or how many times, does a product bug get identified that is within the coverage area of a given test. This goes straight to confidence. There are also two variations here. The first is concentrated only on issues caught downstream of a test in the development process, to illuminate holes in coverage. The second looks upstream of the test, to get a signal on whether a test may be a good candidate for improvement or porting to move it earlier in the development process. This should be collectable from a combination of ticketing system, version control, and CI/CD logs/events.

time to fix total and average – How long does it take to fix test issues for a given test. The total (and the total vs the number of executions or the total amount of time it has existed) tells us what tests need attention for either reworking or review/retirement. The average per test issue tells us how hard, roughly speaking, it is to fix a given test. Variants here are time from first failure to fix and time from start of work on a fix to delivery. This can be collected from a combination of ticketing system, version control, and test result data.

average time to isolation – How long it takes to root cause a failure of a test. This can be sliced up a few ways as well. First, how long it takes to identify if a test failure is a product or test issue. Second, for product issues, how long it takes from first failure or from start of work until the reason for the failure is identified and work on a fix can start. Tests should be diagnostics that facilitate fixes, not just pass/fail gates, and this metric allows us to optimize the factor. This data can be made available from a combination of test results and ticketing system information.

DORA Explora

Development and test have a bidirectional relationship; development produces features and fixes for test to consume, but test also produces results that development consumes as work items and release approval steps. As a result, these metrics feed into DORA stats for both dev and test, though in different ways.

Deployment Frequency

On the test side, initial development and maintenance cost are going to impact this most. For manual testing, and less so for automated, runtime matters as well, but much less so.

On the dev side, it’s reversed from test, in that test runtime is going to directly impact how quickly a build can move through environments. Isolation time is also going to have a fair bit of impact.

Change Lead Time

On the test side, the same metrics apply as for deployment frequency. Also delivery speed for development and product, but those things are out of scope for this post.

On the dev side, again, pretty much the same as deployment frequency. Test failure rate and time to fix is going to play in as well though.

Change Failure Rate

For test, in and out of service ratios are going to be the main aligning metrics. Uncaught issue rate and count also track test deficiencies.

For dev, product failure rate is basically this entire metric.

Service Restoration Time

With testing, time to fix is directly this.

Finally, in development, this will be impacted by fix and isolation time.

Conclusion

If some or all of these metrics are adopted, it’s possible to start having better informed discussions about testing than with the common metrics of test count and feature/code coverage. It’s possible to start talking about meaningful SLAs for test development and quality. Capacity planning and work prioritization start to move from vibes based to empirical. Basically, the ROI of the testing effort can be much more effectively measured and optimized.
Life Challenges

March 19th, 2025

The last couple of days have been tough. I’m currently getting close to losing my apartment since I’ve been out of work. This sometimes leads me into a cycle of depression and panic, where I go from feeling hopeless to not being able to prioritize what thing to do and thus ending up functionally frozen. Do I work on my resume? Coding project? Do some reading? Exercise? As soon as I start any of them, I can’t stop thinking about how I should be doing one of the others, and go nowhere on any of them.

In addition to that, since I might be seeing my daughter a lot less soon, I’m trying to maximize my time with her, and bring as much positivity and energy to my time with her as I can. My situation is not her fault, and she shouldn’t suffer for it.

All that said, it’s definitely time to get back on the horse. I’m going to post a blog today that’s out of band from my AI project, but I’ll go back to that tonight or tomorrow.
Adjustments Made and Planned

March 16th, 2025

Thinking over things since yesterday, I decided I wanted to try adding some more structure and directionality to the network. Instead of just round-robining edge creation between the different io clusters (a simplification I hoped I could get away with, but think I can’t now), I’m going to create at least 2 or 3 intermediate layers of neurons, then make them link up progressively rather than just cross linked to each other. I think this should cut down on the cycle creation that I think is killing the system.

Also, I realized I hadn’t adjusted the short circuiting logic in the interrogate methods to work with the paradigmatic deconfliction I did earlier, so I fixed that (https://github.com/silasray/aiexplore/commit/9fac38b263bf8c9d990a76a0af37383cbb848d3f) in the time I had to work today. Tomorrow, hopefully I can make it through the structuring enhancements.
The Time Has Come For Research

March 15th, 2025

Well, I made the changes I identified yesterday, and got everything running, but I’m hitting issues training. I’m not sure if it’s not enough cycles, the structure of the network, or something else, but this feels like the point at which you either learn from others or learn from long and painful experience. It’s also pretty slow to run, given that it’s single threaded and in Python, so not exactly the best for iteration when each cycle takes minutes even with a trivially small network.

I do need to think through how to make traversal more efficient though. It balloons out across the network way more than I want it to. Again, this could be network structure; maybe there are just too many cycles created from my generation logic. Maybe I’ll look into this tomorrow, unless I decide to dive into research heavily. Also need more tests, but I kinda wanna keep riding the flow and only go back to that when I get stuck or need a little mental break. Onward…
Well, That Won’t Work…

March 14th, 2025

I was poking around at my code today, and I realized that after resolving the internal inconsistency in the design model for my neural net, I can’t get away with the somewhat cheaty method for resolving solutions (https://github.com/silasray/aiexplore/blob/7c92466c5d8c128418d439cc3715e3ac8f5a1ceb/net.py#L125), because the math doesn’t math now that there’s an upper bound to the activation levels for neurons, no matter how many times a resolve is run. I was thinking I could just brute force my way to a solution by rerunning resolve() until some output finally fully activated, but that is impossible now.

So, what needs to be done now.

First, I need to switch over from waiting for full activation of an output, or at least for that being the only mechanism for resolution, to returning a bunch of scored outputs. Surprise surprise, there’s a reason that systems work this way.

Second, since I’m now going to have more than one output per an activation, the reinforce pathway (https://github.com/silasray/aiexplore/blob/7c92466c5d8c128418d439cc3715e3ac8f5a1ceb/net.py#L136) will need to be adjusted. I’m not sure yet if I want to make the reinforce fully centered on an output rather than on an activation, but I think I’m leaning in that direction, because it just feels like it makes sense that way.

I did add a test or 2 today, but nothing worth committing. Looks like the docket is set for tomorrow though.
Project Plan

March 12th, 2025

I’ve been working on a project to build a neural net from scratch (https://github.com/silasray/aiexplore) for the last week or so. My initial goal is to get the current approach to train a model with inputs 0, 1, 2, +, and -; and the outputs 0, 1, 2, and 3, and return the sum or difference given 3 inputs. I’m trying to be conservative with the goal since this system is knowingly naive and inefficient.

After I get that working, I’m going to dive into how to do this with matrices, and from there dive into the GPU/CUDA world. I want to really understand this all, not just use it. I’m tracking my progress here on my blog with a dev diary of sorts as well.

I know I could do this faster by just following some tutorial or something, but honestly, that’s not as fun to me. I want to actually find and solve the problems, not just leverage someone else’s thinking. That’ll come later.
Marginal Adjustments

March 11th, 2025

Didn’t get as much time today as I wanted on this, but at least wanted to be sure to put some time in. Today, I adjusted the algorithm for interrogating the network to resolve a solution to use marginal instead of total accumulation (https://github.com/silasray/aiexplore/commit/7c92466c5d8c128418d439cc3715e3ac8f5a1ceb). This should resolve the conflicting design assumption I realized I had yesterday. Tomorrow, more test writing.
Testing Round 1

March 10th, 2025

I was hoping I could get away with just ad hoc testing for a bit, but issues are proving a little too nuanced to efficiently root out without a more systematic approach. As a result, I started writing some unit tests (https://github.com/silasray/aiexplore/commit/95ae796dfeebb812f4d98e3887b2b832eff04ab3) with pytest to try to nail down expected behavior component by component before trying to analyze integrated behavior. In working on these tests, I also realized that something that I had convinced myself wasn’t an issue actually is.

I have mismatched design assumptions in the interrogate logic. I tried to be a little efficient by terminating traversal when a neuron is fully activated (https://github.com/silasray/aiexplore/blob/main/net.py#L36), but the function of the downstream neurons is cumulatively additive instead of marginally additive (https://github.com/silasray/aiexplore/blob/main/net.py#L68). I need to switch to a marginally additive system anyway, because right now, I think it’s producing runaway activation loops, but that means adding activation linked tracked on the synapses, not just the neurons. I guess that’s the next thing to do then.
Let’s See What I Did Wrong!

March 7th, 2025

I put together a class for initializing and managing a network, with an initial stab at how to distribute Neurons and Synapses between the clusters and the cluster interlinks (https://github.com/silasray/aiexplore/commit/42598085ced3e564c462395fa20fe37754d8e8c3). I’m sure it’s a very naive approach, but it seemed decent enough to not go crazy building some fancy logic to solve a problem I’ve not yet properly identified. Now, run and debug time. Hopefully this doesn’t take as long to debug as it did to write, but that’s sometimes how it goes.

Value

Cost

DORA Explora

Conclusion