Were these actually agents, or just multi-step API calls?

The distinction matters less than people think. Our Bulk Chat feature gave GPT-3.5 a set of tools (Google Search, web scraping, summarization), a task decomposition prompt, and the ability to execute steps sequentially based on intermediate results. That's the same loop that modern agent frameworks use. We just didn't have the vocabulary for it yet.

Why didn't the broader industry adopt agentic patterns in 2023?

Model reliability. In 2023, GPT-3.5 was the workhorse model, and it wasn't reliable enough for complex multi-step reasoning. It would hallucinate tool calls, lose track of its plan, or loop on the same search query. The patterns were sound, but the models needed another generation of improvement before agents became viable at scale. GPT-4 and Claude 3 changed that calculus.

How did Bulk Chat evolve into the current agent system?

Through three stages. Bulk Chat (May 2023) proved the pattern with basic tools. OpenAI Assistants API integration (December 2023) added persistent threads and better tool handling. Dashboard v2 (November 2025) rebuilt everything with 40+ specialized agents, E2B sandboxed execution, and background task delegation. Each stage kept the core loop -- plan, execute, observe, adapt -- while upgrading the infrastructure around it.

Research Agents Before Agents Were Cool

May 2023. Eighteen months before Gartner would declare agentic AI a top strategic technology trend. Twenty months before enterprise inquiries about multi-agent systems surged 1,445%. We shipped a feature called “Bulk Chat” that let AI plan research tasks, search the web, scrape pages, and synthesize findings into structured reports.

We didn’t call them agents. We called them “bulk chat tasks.” But the architecture was the same thing the industry would spend the next two years getting excited about.

What we built in May 2023

The feature was born from a simple observation. Our dashboard users weren’t just chatting with AI. They were conducting research – asking multi-part questions that required gathering information from multiple sources, cross-referencing facts, and producing structured summaries.

A single chat turn couldn’t do that. The user would ask a question, get an answer based on the model’s training data, then have to manually search the web for verification, paste relevant text back into the chat, and ask follow-up questions. The workflow was AI-assisted, but human-driven at every step.

Bulk Chat automated that entire loop.

Task planning. The user provides a research objective. GPT-3.5 decomposes it into discrete subtasks: “Search for X,” “Find statistics about Y,” “Compare A and B,” “Synthesize findings.”

Tool execution. Each subtask gets routed to the appropriate tool. Google Custom Search API for web queries. A full-text web scraper for content extraction from the pages that search returns. The AI summarization pipeline for distilling long documents into relevant excerpts.

Sequential reasoning. Results from each step feed into the next. If the first search doesn’t return useful results, the agent reformulates the query. If a scraped page doesn’t contain the expected information, the agent tries alternative sources. The plan adapts based on intermediate findings.

Structured output. The final result isn’t a chat message. It’s a research report with sections, citations, and source links. Users could verify every claim by following the links back to original sources.

This was, by any modern definition, an agentic system. An AI with access to tools, a planning mechanism, and the ability to execute multi-step workflows autonomously.

Why we didn’t call them agents

The language didn’t exist yet. In May 2023, the AI industry was focused on chat. ChatGPT had launched five months earlier. The conversation was about prompt engineering, not autonomous execution. Papers on ReAct (Reasoning + Acting) and toolformer were circulating in research circles, but the enterprise world hadn’t absorbed the concepts.

We called our feature “Bulk Chat” because it processed multiple research tasks in batch. The name described the user interface, not the underlying architecture. You could submit five research questions, and the system would work through them sequentially, producing a report for each.

The term “agent” in the context of AI systems didn’t enter mainstream discourse until late 2024, when OpenAI, Anthropic, and Google all began shipping agentic features. By the time Gartner published its 2025 hype cycle with agentic AI at the peak, we’d been running the pattern in production for two years.

The tools we gave it

The tool set was minimal by today’s standards, but it was enough to be useful.

Google Custom Search API provided web search. We set up a Custom Search Engine configured for broad web results. Each search returned titles, snippets, and URLs. The cost was manageable – Google’s pricing for Custom Search is per-query, and our rate limiter ensured we didn’t blow through quotas.

Web scraping extracted full-text content from search results. We built a scraping pipeline that handled common web page patterns: article bodies, blog posts, documentation pages. It wasn’t a sophisticated headless browser setup – it was HTTP requests with HTML parsing and content extraction heuristics. Good enough for research-quality content retrieval.

AI summarization condensed scraped content into relevant excerpts. When a scraped page was 5,000 words and the relevant information was in two paragraphs, the summarizer extracted those paragraphs. This was crucial for managing the context window – GPT-3.5’s 4,096 token limit meant we couldn’t just dump raw page content into the prompt.

These three tools – search, scrape, summarize – formed a loop that could answer most research questions. The agent would search for information, scrape the most promising results, summarize the findings, and decide whether it had enough data to produce a final answer or needed to search again.

The model constraints that shaped the architecture

Building agents on GPT-3.5 in 2023 was an exercise in working around limitations.

The 4,096 token context window was the primary constraint. Our task planning prompt, the conversation history, the tool results, and the final output all had to fit within that window. We built aggressive context trimming that dropped the least relevant tool results when the window got tight.

Hallucination was the second constraint. GPT-3.5 would sometimes fabricate search results or cite pages that didn’t exist. Our architecture addressed this by keeping the AI out of the search and scrape steps entirely – those were deterministic tool calls. The AI planned which searches to run and interpreted the results, but the actual data retrieval was code, not generation.

Reliability was the third constraint. Multi-step plans would sometimes go off the rails. The agent might reformulate a query in a way that returned irrelevant results, then try to synthesize those irrelevant results into a coherent answer. We added guardrails: maximum step counts, relevance scoring on search results, and fallback behavior when a research path wasn’t productive.

These constraints forced us to build robust orchestration around unreliable intelligence. That principle – trust the AI for reasoning, trust code for execution – became foundational to everything we built afterward.

The path from Bulk Chat to agents

The evolution happened in three distinct phases.

Phase 1: Bulk Chat (May 2023). The original implementation. GPT-3.5 with Google Search, web scraping, and summarization. Batch processing of research tasks. Basic task planning.

Phase 2: OpenAI Assistants API (December 2023). When OpenAI released the Assistants API, we migrated Bulk Chat to use it. Assistants provided persistent threads (no more manual context management), built-in tool calling (cleaner than our custom implementation), and retrieval augmentation (upload documents, ask questions against them). The migration preserved our research workflow while upgrading the infrastructure beneath it.

Phase 3: Dashboard v2 agents (November 2025). The complete rebuild. 40+ specialized agents, each with distinct capabilities. E2B sandboxed code execution. Background task delegation. The Bulk Chat pattern – plan, execute tools, synthesize – became the core loop for every agent in the system.

Each phase kept the same fundamental architecture while upgrading the model capabilities, the tool set, and the execution environment around it.

What being early taught us

Being 18 months ahead of the agentic wave gave us something that can’t be compressed: production experience.

By the time “agentic AI” became a conference buzzword in late 2024, we’d already learned which research tasks agents handle well (fact-gathering, comparison, summarization) and which they handle poorly (nuanced analysis, tasks requiring domain expertise the model lacks, anything requiring real-time data fresher than the search index).

We’d learned that task decomposition quality determines agent quality. A well-planned research strategy with mediocre tools outperforms a poorly planned strategy with excellent tools. The planning step is where most agent failures originate, and it’s where we invest the most engineering effort.

We’d learned that citation and source tracking aren’t optional. Users don’t trust agent-generated research unless they can verify it. Every claim needs a link. Every statistic needs a source. Building this into the architecture from day one saved us from the trust problems that later entrants struggled with.

We’d learned that the model matters less than the orchestration. When we upgraded from GPT-3.5 to GPT-4 to Claude, the research quality improved. But the improvement came from better task planning and more reliable tool calling, not from the model knowing more facts. The orchestration layer – the code that manages the plan-execute-observe loop – is where the real product value lives.

From Bulk Chat to Pro Search

The direct descendant of Bulk Chat is Pro Search, which we shipped in December 2024. The pattern is identical: user provides a research question, the system plans searches, executes them, scrapes results, and synthesizes findings. The implementation is entirely different – better models, better tools, better citation handling, better output formatting.

Pro Search exists because Bulk Chat proved the concept. And Bulk Chat existed because we were willing to ship something useful before we had the vocabulary to describe what it was.

The lesson is deceptively simple: if you see a user need, build for it. Don’t wait for the industry to agree that the pattern has a name. By the time “agentic AI” became a category, we had two years of production data, failure modes we’d already addressed, and an architectural foundation we’d already validated.

We weren’t visionaries. We were practitioners who noticed that research was a multi-step process and built software that treated it as one. The rest was timing.

Research Agents Before Agents Were Cool

What we built in May 2023

Why we didn’t call them agents

The tools we gave it

The model constraints that shaped the architecture

The path from Bulk Chat to agents

What being early taught us

From Bulk Chat to Pro Search

Related Posts

5,876 Commits Across Three AI Products

Building Custom GPTs Before OpenAI Did

Forty Agents and Counting

See what AIWAYZ can do for your team

Products

Solutions

Company

Legal