Impressive work, but I'm confused on a number of fronts:
- You are serving closed models like Claude with your CTGT policy applied, yet, the way you described your method, it involves modifying internal model activations. Am I misunderstanding something here?
- Could you bake the activation interventions into the model itself rather than it being a runtime mechanism?
- Could you share the publications of the research associated with this? You stated it comes from UCSD.
- What exactly are you serving in the API? Did you select a whitelist of features to suppress you thought would be good? Which ones? Is it just the "hallucination" direction that you showcase in the benchmark? I see some vague personas, but no further control other than that. It's quite black-boxy the way you present it right now.
I don't mean this as a criticism, this looks great, I just want to understand what it is a bit better.
>yet, the way you described your method, it involves modifying internal model activations
It's a subtlety, but part of it works on API based models, from the post:
"we combine this with a graph verification pipeline (which works on closed weight models)"
The graph based policy adjudication doesn't need access to the model weights.
>Could you bake the activation interventions into the model itself rather than it being a runtime mechanism?
You could via RFT or similar on the outputs. It functions as a layer on top of the model without affecting the underlying weights, so the benefit is that it does not create another artifact for a given customization.
>What exactly are you serving in the API?
It's the base policy configuration that created the benchmark results, along with various personas to give users an idea of how uploading a custom policy would work.
For industry-specific deployments, we have additional base policies that we deploy for that vertical, so this is meant to simulate that aspect of the platform.
What do you mean by this? Does the method involve playing with output token probabilities? Or modifying the prompt? Or blocking bad outputs?
> how uploading a custom policy would work
Do you have more info on this? Is this something you offer already or something you are planning? How would policies be defined, as a prompt? As a dataset of examples?
We create a policy hierarchy with a graph structure, based on certain elements of generative content coming in to our system, as well as what we know about the application where it's deployed.
The main benefit is we can traverse this graph deterministically when evaluating content and determine which policies need to be applied (if any) in a more rigorous manner than just, say, stuffing 900 FINRA rules into a prompt.
On custom policies, yes, this is core functionality of our deployed product. This typically looks like PDFs, doc files, or even Slack transcripts with relevant business info. The policy engine discretizes these into tone, forbidden words, key phrases etc. that form the elements of the aforementioned graph.
Congrats on the launch - you're value-add is quite confusing as someone that's at the applied AI layer. This comes off as more of a research project than a business. You're going to need an incredibly compelling sales pitch for me to send my data to an unknown vendor to fix a problem that might be obviated by the next model release (or just stronger evals with prompt engineering). Best of luck.
> they mimic common misconceptions found on the internet (e.g. "chameleons change color for camouflage")
Wait what, what do chameleons actually change color for then?? TIL.
---
So if I understand correctly, you take existing models, do fancy adjustments to them so that they behave better, and then sell access to that?
> These are both applications where Fortune 500 companies have utilized our technology to improve subpar performance from existing models, and we want to bring this capability to more people.
Can you share more examples on how your product (IIUC, a policy layer for models) is used?
The product integrates as a layer on top of their existing models, serving as a policy-as-code layer so they don't have to fine-tune, prompt engineer etc. to get them up to par in their deployments as is standard now.
One example that I like discussing is insurance, where the local, state, and federal policy landscape changes frequently. We worked with an Inc. 5000 Insurtech that had issues with NAICS codes hallucinating, which are used to profile risk of an individual's profession. Their enterprise Claude model generated a NAICS code that was valid and passed AWS Bedrock's guardrails, but wasn't valid for the year the claim was made. We were able to catch that with the policy engine.
Can you share more about the challenges ran into on the benchmarking? According to the benchmark note, Claude 4.5 Opus and Gemini 3 Pro Preview exhibited elevated rejection and were dropped from TruthfulQA without further discussion. To me this begs the questions, does this indicated that frontier closed SOTA model will likely not allow this approach in the future (ie in the process of screening for potential attack vectors) and/or that this approach will only be limited to a certain LLM architecture? If it’s an architecture limitation, it’s worth discussing chaining for easier policy enforcement.
So if I understand, this is basically advanced activation steering as a service? And you have already identified vectors for several open models that make them more truthful or better at reasoning and apply them automatically?
Because the API has a persona option which might be achieved with something like this https://github.com/Mihaiii/llm_steer or maybe for closed models you just have to append to the prompt.
What open source models are available? In the docs I only see mention of Google Flash Lite or something which is closed.
Are you not concerned that model creation companies will bake this into their next model? I am trying to understand business model.
Another question is how you would claim credit. People believe the quality of the end result depends only on the model, with serving only responsible for speed.
We had this question come up frequently during our fundraise.
Our customers' risk profile is such that having the model provider also be the source of truth for model performance is objectionable. There's value to having an independent third party that ensures their AI is doing what they intend it to, especially if that software is on-prem.
On the credit point, that's not necessarily what we're after in these deployments. This is a happy alignment of relatively esoteric research that personally excited me and a real business problem around the non-deterministic nature of GenAI. Our customers typically come to us with a need to solve that for one reason or another.
> Are you not concerned that model creation companies will bake this into their next model?
Usually, the business strategy when that's a concern is to court an acquisition.
Assuming that you're doing actual innovation and that the effort behind making it commercially mature is non-trivial, your company and its established assets/staff/insights/deals become valuable as a way to leapfrog in.
SOTA results are a happy byproduct of the core mission of our approach, which is to enable the effective and simple translation of policy documents into a model without having to fine-tune and prompt engineer. This performance is somewhat unexpected but also sensical, so we're still trying to figure out the best way to harness it. That may include releasing model artifacts in the future.
Ah - I didn't realize the title was linked to a URL (https://playground.ctgt.ai). We usually let Launch HNs be text posts so I've taken that link out of the title now.
Impressive work, but I'm confused on a number of fronts:
- You are serving closed models like Claude with your CTGT policy applied, yet, the way you described your method, it involves modifying internal model activations. Am I misunderstanding something here?
- Could you bake the activation interventions into the model itself rather than it being a runtime mechanism?
- Could you share the publications of the research associated with this? You stated it comes from UCSD.
- What exactly are you serving in the API? Did you select a whitelist of features to suppress you thought would be good? Which ones? Is it just the "hallucination" direction that you showcase in the benchmark? I see some vague personas, but no further control other than that. It's quite black-boxy the way you present it right now.
I don't mean this as a criticism, this looks great, I just want to understand what it is a bit better.
>yet, the way you described your method, it involves modifying internal model activations
It's a subtlety, but part of it works on API based models, from the post:
"we combine this with a graph verification pipeline (which works on closed weight models)"
The graph based policy adjudication doesn't need access to the model weights.
>Could you bake the activation interventions into the model itself rather than it being a runtime mechanism?
You could via RFT or similar on the outputs. It functions as a layer on top of the model without affecting the underlying weights, so the benefit is that it does not create another artifact for a given customization.
>What exactly are you serving in the API?
It's the base policy configuration that created the benchmark results, along with various personas to give users an idea of how uploading a custom policy would work.
For industry-specific deployments, we have additional base policies that we deploy for that vertical, so this is meant to simulate that aspect of the platform.
> graph based policy adjudication
What do you mean by this? Does the method involve playing with output token probabilities? Or modifying the prompt? Or blocking bad outputs?
> how uploading a custom policy would work
Do you have more info on this? Is this something you offer already or something you are planning? How would policies be defined, as a prompt? As a dataset of examples?
We create a policy hierarchy with a graph structure, based on certain elements of generative content coming in to our system, as well as what we know about the application where it's deployed.
The main benefit is we can traverse this graph deterministically when evaluating content and determine which policies need to be applied (if any) in a more rigorous manner than just, say, stuffing 900 FINRA rules into a prompt.
On custom policies, yes, this is core functionality of our deployed product. This typically looks like PDFs, doc files, or even Slack transcripts with relevant business info. The policy engine discretizes these into tone, forbidden words, key phrases etc. that form the elements of the aforementioned graph.
Okay, but what does "applied" look like? Including a prompt?
Congrats on the launch - you're value-add is quite confusing as someone that's at the applied AI layer. This comes off as more of a research project than a business. You're going to need an incredibly compelling sales pitch for me to send my data to an unknown vendor to fix a problem that might be obviated by the next model release (or just stronger evals with prompt engineering). Best of luck.
> they mimic common misconceptions found on the internet (e.g. "chameleons change color for camouflage")
Wait what, what do chameleons actually change color for then?? TIL.
---
So if I understand correctly, you take existing models, do fancy adjustments to them so that they behave better, and then sell access to that?
> These are both applications where Fortune 500 companies have utilized our technology to improve subpar performance from existing models, and we want to bring this capability to more people.
Can you share more examples on how your product (IIUC, a policy layer for models) is used?
The product integrates as a layer on top of their existing models, serving as a policy-as-code layer so they don't have to fine-tune, prompt engineer etc. to get them up to par in their deployments as is standard now.
One example that I like discussing is insurance, where the local, state, and federal policy landscape changes frequently. We worked with an Inc. 5000 Insurtech that had issues with NAICS codes hallucinating, which are used to profile risk of an individual's profession. Their enterprise Claude model generated a NAICS code that was valid and passed AWS Bedrock's guardrails, but wasn't valid for the year the claim was made. We were able to catch that with the policy engine.
I believe they change color to express emotion.
They change color to communicate AND to regulate body temperature AND as camouflage.
It is not a ‘myth’ that one of the use cases for their color changing is camouflage, I’m not sure what they are on about.
Can you share more about the challenges ran into on the benchmarking? According to the benchmark note, Claude 4.5 Opus and Gemini 3 Pro Preview exhibited elevated rejection and were dropped from TruthfulQA without further discussion. To me this begs the questions, does this indicated that frontier closed SOTA model will likely not allow this approach in the future (ie in the process of screening for potential attack vectors) and/or that this approach will only be limited to a certain LLM architecture? If it’s an architecture limitation, it’s worth discussing chaining for easier policy enforcement.
I checked with the team and it may have been some temporary rate-limiting issue. We've rectified the results, it seems to be an isolated case.
https://www.ctgt.ai/benchmarks
Thanks for the thoroughness! I look forward to the next steps as you all apply this approach in other unique ways to have even better results.
Are these benchmarks correct that adding Anthropic's Constitutional AI system prompt lowered results across all the models?
So if I understand, this is basically advanced activation steering as a service? And you have already identified vectors for several open models that make them more truthful or better at reasoning and apply them automatically?
Because the API has a persona option which might be achieved with something like this https://github.com/Mihaiii/llm_steer or maybe for closed models you just have to append to the prompt.
What open source models are available? In the docs I only see mention of Google Flash Lite or something which is closed.
--I was able to jailbreak it--
https://playground.ctgt.ai/c/5028ac78-1fa4-4158-af73-c9089cb...
Nevermind That was the ungoverned version of gemini, their models worked.
It was able the resist a different known jailbreak for gemini though
https://playground.ctgt.ai/c/a5aec2dc-c40d-4232-8bb1-69a1cec...
Glad you played around with it and that our tech worked.
Are you not concerned that model creation companies will bake this into their next model? I am trying to understand business model.
Another question is how you would claim credit. People believe the quality of the end result depends only on the model, with serving only responsible for speed.
We had this question come up frequently during our fundraise.
Our customers' risk profile is such that having the model provider also be the source of truth for model performance is objectionable. There's value to having an independent third party that ensures their AI is doing what they intend it to, especially if that software is on-prem.
On the credit point, that's not necessarily what we're after in these deployments. This is a happy alignment of relatively esoteric research that personally excited me and a real business problem around the non-deterministic nature of GenAI. Our customers typically come to us with a need to solve that for one reason or another.
> Are you not concerned that model creation companies will bake this into their next model?
Usually, the business strategy when that's a concern is to court an acquisition.
Assuming that you're doing actual innovation and that the effort behind making it commercially mature is non-trivial, your company and its established assets/staff/insights/deals become valuable as a way to leapfrog in.
Of course. That would make them a research company -- with a limited selection of potential buyers. It's not the worst gig.
Running into "no healthy upstream" when navigating to the link -- hug of death maybe?
Indeed, we had a huge influx, should be back up now. Thanks for pointing it out
Why not apply changes to the underlying model so that you crush every available eval?
SOTA results are a happy byproduct of the core mission of our approach, which is to enable the effective and simple translation of policy documents into a model without having to fine-tune and prompt engineer. This performance is somewhat unexpected but also sensical, so we're still trying to figure out the best way to harness it. That may include releasing model artifacts in the future.
The link sends me to a Chat UI with no context about the product. An intro or walkthrough would be useful.
Check out the walkthrough linked in the post: https://video.ctgt.ai/video/ctgt-ai-compliance-playground-cf...
Ah - I didn't realize the title was linked to a URL (https://playground.ctgt.ai). We usually let Launch HNs be text posts so I've taken that link out of the title now.
do you see the looming butlerian jihad as a challenge to your business model?
We'll be back when the Holy War begins.
> where the fallout
Heh.