Running Agentic Security Questionnaires with Prefect Cloud

Vendor security questionnaires are one of those operational realities every software company knows well. They show up in different formats, they carry real business weight, and they rarely stay small for long.

Before we built this workflow, each assessment could take hours, sometimes days. Someone had to intake the questionnaire, break out the questions, search for prior answers, find the right policy or report, check product-specific details, pull in the right internal owner when something was unclear, and then package everything into a response that was accurate enough to send outside the company.

The bottleneck was not writing. It was rebuilding institutional memory every time a questionnaire arrived.

We knew pretty early that this was not just an LLM problem. We were not looking for a tool that could generate plausible security answers in isolation. We needed a system that could actually run the process: ingest questions, retrieve the right context, branch when the answer depended on product or deployment model, pause when a human needed to review something, and then capture the best outputs so the next assessment would be easier.

We solved it by putting Prefect Cloud managed execution at the center of the workflow.

Why we chose Prefect

The most important decision we made was how and where to run the workflow.

Security questionnaires are a classic bursty internal workload. Some weeks we have several active assessments. Other weeks we have none. We wanted the benefits of orchestration, visibility, retries, and structured workflow execution, but we did not want to run and maintain idle infrastructure just to be ready for the next request.

Managed execution was a natural fit.

With Prefect Managed work pools, we can run flows on Prefect’s infrastructure without operating our own worker layer for this use case. The Prefect Managed infrastructure guide describes that model: remote execution without standing up workers or a separate cloud account for the execution layer. That gave us a clean way to orchestrate an agentic workflow without taking on more platform overhead.

The trade-off fit our use case well. We are not especially sensitive to cold-start latency here. If a run takes an extra minute to start, that is still dramatically better than a process that used to consume hours or days of manual work. What mattered much more was getting a dependable execution layer with a scale-to-zero operating model, which lines up with the economics Prefect outlined when it introduced Prefect Serverless.

The point was to automate the work without creating another system we had to babysit.

Treating the problem as a workflow, not a prompt

Once we framed the problem correctly, the architecture became much clearer.

A vendor questionnaire is not a single generation step. It is a sequence of decisions. We need to ingest the questionnaire, normalize the questions, retrieve the most relevant prior answers, ground those answers in supporting documents, and then decide what should happen next for each question.

Some questions are easy and can be answered from verified prior responses. Some need supporting evidence from policy documents or reports. Some require product-specific handling. Some are confident enough to move forward automatically. Others should stop and wait for a human review before they make it into the final packet.

Orchestration is what makes the difference.

This workflow does not behave like a fixed pipeline. It behaves more like a stateful process that branches at runtime, loops when needed, and pauses when human judgment is required. That is the kind of workflow Prefect describes for AI teams: dynamic control flow and human-in-the-loop patterns instead of a static DAG. Prefect gave us a practical way to run that process in production.

Once agentic workflows handle real business processes, governance and interoperability stop being abstract concerns. That is part of why Prefect joined the Agentic AI Foundation.

Building on the tools we already used

We also wanted this to fit into the way our team already worked.

We were already using Notion as the system of record for tracking assessments, so instead of introducing a new operational tool, we extended what we had. Each assessment lives as a Notion database entry. By adding templates to those entries, we created a consistent structure for managing questionnaire questions, tracking which ones are answered, and promoting useful responses into our knowledge base once they have been reviewed.

Notion became the operational layer for the process. It tracks the state of the work, the ownership of answers, and the curation of reusable knowledge.

Prefect became the execution layer. It runs the questionnaire workflow itself, coordinates the branching logic, and handles the places where work needs to pause, resume, or escalate.

Behind that, the rest of the stack is intentionally simple. We use AWS S3 vector buckets for retrieval, AWS Nova Lite and Nova Pro for inference, and a combination of prior answers and supporting documentation to ground the output.

The separation of responsibilities ended up being clean: Notion tracks the work, Prefect runs it, and the knowledge layer improves it over time. Here is what that looks like end to end:

Assessment workflow architecture showing Notion feeding into the Assessment Pipeline and KB Pipeline, with S3 vector storage for retrieval and Bedrock LLM for inference.

We were up and running within days. When a retrieval step returned weak matches or a question stalled waiting on human review, we could see exactly where the run needed attention.

The memory layer mattered as much as the workflow

The workflow only became reliable once we invested in the memory layer behind it.

We extended our Notion workspace with a knowledge base of previously answered security questions. Those entries can be verified, assigned an owner, and scoped to specific products. That last part matters a lot in practice. A generic answer might sound correct, but still miss what the customer is actually asking if the question depends on a specific product, control boundary, or deployment model.

We also built a supporting document database that contains policies, reports, and other source material. Those documents feed the retrieval layer so the workflow is not relying only on historical questionnaire responses. It can also ground answers in evidence that can stand up to scrutiny from a security reviewer.

Both layers together made the output dependable. Prior answers give the system speed and consistency. Supporting documents provide evidence and confidence. We found that we needed both.

The feedback loop is where the compounding value shows up

The difference between a one-shot automation and a system that compounds is what happens after each run.

When an assessment is complete, the strongest answers can be reviewed, verified, assigned to an owner, and added back into the knowledge base for future reuse. The result is a feedback loop: every completed questionnaire leaves behind structured knowledge that improves the next run.

Over time, that changes the shape of the work. More of the questionnaire starts from strong defaults. Fewer questions require net-new effort. The human work shifts away from reconstructing the same answers over and over and toward reviewing edge cases, tightening nuanced responses, and improving the overall system.

With each cycle, the share of questions the system can answer confidently grows, and the time the team spends per assessment shrinks.

Why the economics worked

Our cost to complete an entire assessment has been about $0.01 per question, covering inference through AWS Nova, retrieval against S3, and the Prefect run itself.

That number surprised us. The process we replaced involved hours of skilled human time across multiple people. Costing less than $1 per assessment, the workflow pays for itself the moment it saves a single person one minute. Even if the per-run cost were ten times higher, the ROI would still be evident.

What changed for the team

The first thing that changed was speed. Assessments that used to take hours or days can now move in minutes.

But the more meaningful change is how the work feels operationally.

We do far less copy-paste work. Answers are more consistent. Ownership is clearer. More of each response is grounded in actual source documents instead of inbox archaeology or memory.

Most importantly, the humans in the loop are now spending time where their judgment matters most. They are reviewing ambiguous cases, checking product-specific nuance, and improving the knowledge base. They are not rebuilding the same answer set from scratch every time a vendor questionnaire arrives.

Why this worked for us

The model was the easiest part to get right. What made this successful was everything around it: a knowledge base that compounds with each assessment, human review where judgment matters, and a feedback loop that turns completed work into reusable institutional memory.

If you are looking at a similar problem, start with the execution and review layer. Get the orchestration, the knowledge base, and the human checkpoints working before you invest in prompt engineering. If you want to build agentic workflows on serverless infrastructure, Prefect Cloud is free to start.