← All projects
Live Updated: 23 June 2026

B2B lead engine (RO company data)

A complete B2B sales-intelligence engine for the Romanian market, built entirely solo, in Python with LLM orchestration (Claude Code + DeepSeek). It turns the public company registry into a daily feed of pre-qualified leads.

It pulls from 16+ data sources (ANAF via a 3-endpoint API, ONRC, mfinante, VIES, risco.ro, TERMENE, listafirme, Google Maps & Business Profile, BuiltWith, Hunter/Apollo, OpenCorporates, Clay, Similarweb, Semrush, DataForSeo), enriches them, scores them 0-100 on estimated revenue, fit and reachability, then qualifies them automatically.

The result: 8,307 firms at 99% data coverage, 10,000 firms/day throughput on a single proxy account, and 1,000 pre-qualified leads/day into the CRM, a manual process (Excel + Google/AI-agent checks) turned into an automated pipeline.

Read the full case study →

The problem

When I took over prospecting, everything lived in a single shared Excel spreadsheet. Each firm was a row, and qualification happened by hand, one line at a time. Enrichment was entirely manual: I would open each firm on Google and on a handful of AI chat assistants, pick out whatever I could find, a phone number here, a website there, an impression of whether the business was still active, and paste it all back into the spreadsheet. There were no scripts, no registry APIs, no ANAF/ONRC access, no automation of any kind. From the moment a firm appeared as a possible target to the moment it was actually qualified took roughly one week of hand work.

To understand why it was so slow, it helps to break down what "by hand" actually meant. For a single firm, the flow looked like this: search the name on Google, open the top results to confirm the firm existed and was still trading, ask one or two AI chat assistants to gather context, manually copy the contact details, then switch back to Excel and fill in the cells. Each of those steps was a point where a transcription error or a stale fact could slip in. There was no single source of truth, only whatever I could assemble inside a browser window and paste back.

The process was honest, the data I gathered was real, but it simply did not scale. One person could push through a few dozen firms before the roughly one-week lag turned any list into stale data. And staleness was not a theoretical worry: a firm that was struck off between the moment I added it to the list and the moment I reached it would sit there looking like a valid target when in fact it no longer existed. Contacts went missing because I had no systematic way to find them, if a number did not surface in the first Google results, the row stayed blank. And decisions about which firm was worth approaching rested on a gut impression rather than verified signals.

The underlying problem was that volume and quality pulled against each other. If I tried to process more firms per day, the time spent on each one dropped and verification quality dropped with it. If I checked each firm thoroughly, the number of firms per day collapsed. In the manual model there was no way to raise both at once. The shared Excel amplified this: being a single hand-edited file, it had no per-firm state history, could not automatically flag a firm as struck off, and could not re-check anything. It was a static snapshot of a moment that started aging the instant you saved it.

That gap, the week between a firm appearing as a possible target and a firm being qualified, is exactly why I built the pipeline. I did not start from a wish to use technology for its own sake; I started from a very concrete constraint: one person, one Excel file, a week of lag, and a list that went stale faster than I could clean it. Every engineering decision that followed was a direct reaction to one of these limits. The absence of a source of truth called for querying official registries directly instead of picking through Google results. The repeated manual work on every firm called for automation that runs without supervision. The one-week lag called for a system capable of chewing through thousands of firms a day rather than a few dozen.

So I set myself a goal that was simple to state and hard to hit: turn a one-week manual process into a system that chews through thousands of firms a day, completely unattended. "Unattended" was the important part, I did not want a tool that merely sped up the hand work, but one that replaced it, running on its own, checking firm status on its own, and delivering signals I could trust without manually reconfirming them. The rest of the story is how I got from that shared spreadsheet to the automated pipeline that replaced every step described above.

The data sources

The pipeline consumes 16+ sources, but they are not equal in weight. The backbone is ANAF, and I want to be precise about why: I consume it mainly through its live API, not just through the bulk download. The distinction matters as engineering, a bulk export is a photograph of yesterday, while an API call gives you the state right now, which is exactly what you need when deciding whether a firm is worth contacting today. There are three live endpoints, each chosen for a precise role:

  • Identity / VAT register, the v9 PlatitorTvaRest web service. I send a JSON array of CUIs (up to 500 per call), free, no auth. The batch-of-500 is not cosmetic: it turns an O(n) call problem into O(n/500), so the entire company universe is queried in a few hundred requests instead of hundreds of thousands. For each firm I get the name, the trade-register number, address, phone (about 71% populated), registration date, VAT-payer status, active/inactive status, the inactivation/reactivation/struck-off dates, legal form and CAEN code, plus recent signals such as VAT-on-collection, Split-VAT and e-Factura. This is the step-1 fiscal qualification gate and, at the same time, a preflight filter for struck-off/inactive firms, if I drop them here, I never spend expensive downstream calls on them. The filter fails open: if the API errors, the firm is treated as active, because I prefer a false positive (a dead firm that slips through) over losing a good firm to a network timeout.
  • Financial statements (bilant), a GET endpoint, one firm-year per request, no batching. Here I have no choice: ANAF exposes balance sheets only individually. The 1 request/second limit is respected strictly, and on a connection reset I retry once with a 2-second cooldown, long enough for the connection to settle without wasting time. It requires Chrome120 TLS-fingerprint impersonation, or ANAF refuses the connection at the handshake, before any HTTP; miss that detail and it looks like the server is down when it is actually rejecting you as a bot. I map 20 indicators (I1..I20) to fields: turnover, net profit, net loss, employee count, liabilities and so on, precisely the raw material for a firm's size-and-health score.
  • Batch trade-register (J) lookup, the same v9 service, but driven in batch mode (500 CUIs per request, 200-500ms response) specifically to recover the trade-register number alongside fiscal status. It is a second job of the same endpoint, kept logically separate because the J number is what links a firm to ONRC.

Beyond the three live APIs there is an optional ANAF bulk download (CKAN data.gov.ro): four files that together cover the full company universe. The bulk is useful for seeding the starting list without thousands of calls, but the current-state deltas, e-Factura, Split-VAT, inactivation date, are NOT in the bulk CSVs. That confirms why ANAF is consumed mainly via the live API, with the bulk as a supplement only: the bulk tells you who exists, the live API tells you how things stand right now.

The rest of the sources cover specific gaps. ONRC (the live Recom API via myportal plus monthly bulk CSVs from data.gov.ro, the firm registry and the 333 MB legal-representatives file) links the firm to people and to the official record. VIES validates intra-community VAT over SOAP, useful for exporting firms. BPI and court records come via the free ECRIS public SOAP service (portal.just.ro), searched by party name, litigation and insolvency signals. risco.ro supplies risk and financial data (paid); TERMENE covers procurement and court (paid, at most 75 firms per query, which is why I use a bracketing algorithm over turnover ranges to split the universe into slices that fit under the cap). listafirme.ro serves website discovery; firme.info/firmeapi.ro give ANAF debts and insolvency; and for web-stack and SEO I use BuiltWith, SimilarWeb, Semrush, DataForSeo and Google Custom Search. Google Maps / Google Business Profile are active live sources, used for local presence and business-profile data; OpenCorporates remains an active source for international corporate data. The whole collection is run under a single BrightData proxy account, tuned to roughly 10,000 firms per day.

Ingestion

Ingestion is the layer where raw, heterogeneous data turns into a disciplined stream. The governing principle is simple: each source has its own client, but they all follow the same discipline. That separation is not accidental, every public portal behaves differently, expects different headers, and fails in its own way, so forcing every source through a single generic client would have meant fragility. A per-source client instead isolates each portal's quirks, while the shared discipline (timeouts, retries, output shape) stays uniform across all of them.

The default transport is curl_cffi with Chrome120 TLS-fingerprint impersonation, falling back to stdlib requests/urllib. The reason the TLS fingerprint matters is that many servers inspect the TLS handshake, the cipher ordering, the extensions, to tell a real browser apart from an automated client; an ordinary Python client has a characteristic, easily-rejected fingerprint. By imitating the Chrome120 fingerprint exactly, the connection looks like a legitimate browser. For the ANAF bilant endpoint, this Chrome120 TLS fingerprint is not optional: without it, ANAF refuses the connection. The fallback to stdlib requests/urllib exists as a safety net, if the impersonation library is missing or fails in some environment, the pipeline does not stop; it continues on a simpler transport.

Once fetched, the data collects into a per-firm aggregator. This is the design choice that keeps everything coherent: no matter how many sources the information arrives from, it converges to a single place, indexed by firm. The v8 enrichment runner drives an idempotent state machine across the steps FETCHED → CUI_VERIFIED → ANAF_DONE → RISCO_DONE → QUALIFIED → PUSHED. Idempotency is the central operational guarantee here: if a step is already reached it is a no-op. Concretely, that means if the runner restarts after a network drop halfway through a firm, the restart does not redo the already-finished steps and does not duplicate data. A machine error cannot corrupt a partial firm, it resumes from checkpoint. That property turns a fragile process, which once had to be watched by hand, into one that can pick itself back up from wherever it was interrupted.

The actual merge reads every per-source JSON for a normalized CUI and writes one canonical record. CUI normalization is the quiet but critical step: the same fiscal code can appear with an RO prefix, with spaces, or with different leading zeros across sources, and without a canonical form the same firm would fragment into several records. The write is atomic, it writes to a temp file, then replaces, so a reader never sees a half-written record, and an interruption mid-write always leaves either the old record intact or the new one complete, never something corrupt in between.

Nine sources are wired into the main aggregator: ANAF identity, ANAF financials, admins, firme.info, listafirme, Google Knowledge Panel, VIES, traffic and the lead score. Each contributes a different facet of the firm's profile, fiscal identity and status, financials, the people in charge, online presence, and the lead's quality signals, and their combination yields a far richer picture than any single source could give. Alongside these, a second signals aggregator folds six cross-border sources together and applies fresh extraction regexes for international couriers (DHL, FedEx, UPS, DPD, GLS, TNT, Aramex), country selectors and non-RON currencies / non-RO languages, to score cross-border intent. The logic is intuitive: a firm that names international couriers, prices in currencies other than RON, or serves content in non-RO languages signals activity that reaches beyond the country's borders, exactly the kind of intent a cross-border B2B operator wants to detect. Fresh regexes means these patterns are applied each run against current data, not inherited from a stale cache.

At the financial level the merge prefers the turnover from the live ANAF API and only falls back to firme.info parsed revenue when ANAF has none. This trust hierarchy is deliberate: the live ANAF API is the authoritative, official source, so it always takes priority; firme.info acts strictly as a backstop, used only when the authoritative source is silent. This avoids the situation where a less reliable figure overwrites an official one, and keeps the financial data as close to the official truth as possible.

Anti-bot and scaling to about 10,000/day on one proxy

All of the scaling ran on a SINGLE BrightData residential proxy account. Choosing not to buy more accounts was deliberate: one account means one reputation-risk surface, predictable costs, and forced discipline, if you cannot buy extra bandwidth, you are obliged to use it intelligently. Early on I ran very conservative per-firm throttling, because the first goal was not speed but avoiding burning the account before I understood how each source reacted. As I learned the thresholds, I tuned the system to sustain about 10,000 firms/day on that one account (May 2026).

I want to be honest about the figures, because this is exactly where most case studies inflate things: the 10,000/day is the configured CEILING (DAILY_CAP), not an observed average. The realistic per-path throughput is about 7,200-8,700/day, and a meaningful part of that volume is carried by free home-IP scraping rather than the paid proxy. In other words, the cap protects the account from an abnormal day, while the residential home IP absorbs the "cheap" requests so the proxy budget is saved for the genuinely hard sources.

The full anti-bot arsenal, and why it looks this way:

  • Chrome120 TLS-fingerprint impersonation via curl_cffi (JA3 + HTTP/2), not just User-Agent strings. Many people think swapping the User-Agent is enough, but modern firewalls read the TLS handshake itself: cipher order, extensions, HTTP/2 behaviour. If the header says "Chrome" but the JA3 fingerprint screams "Python library", you are already flagged. The risky path rotates a fingerprint pool [chrome120, chrome119, safari17_0, edge99] so no single signature repeats forever.
  • A 5-tier fetch ladder, cheap to expensive: curl_cffi → subprocess curl HTTP/1.1 → HTTPS→HTTP downgrade → Playwright stealth → BrightData Scraping Browser over CDP. The logic is simple: try the zero-cost, fastest method first; escalate to a real browser only when genuinely necessary, because a headless browser consumes tens of times more resources than a plain fetch.
  • Cloudflare-stub detection: the "Just a moment" interstitials that arrive with HTTP-200 are the classic trap, the page looks fine (200 status) but the body is empty. These are caught and force escalation to the stealth browser on Playwright; this is exactly the "Cloudflare bypassed via stealth browser" moment.
  • Client-side rate-limiting via file-mtime token-bucket locking: a global per-source lock across all parallel workers, with no central server. Using a file's timestamp as a token-bucket means dozens of parallel processes coordinate their pace without a Redis queue or a dedicated service, fewer moving parts to break.
  • A single proxy account with per-request IP rotation (sticky was added, then dropped, because keeping the same IP increased the risk of being correlated). Failsafes probe the exit IP at the start and end of each batch, and a daily audit counts unique IPs, if rotation gets "stuck" on one IP, you see it immediately in the report.
  • An 8-entry User-Agent rotation pool as an extra layer on top of the TLS fingerprint.
  • Windowless CREATE_NO_WINDOW spawns, so no console windows pop up on screen and, more importantly, no orphan scrapers linger consuming resources after a crash.
  • Humanization: log-normal jitter (pauses with a real distribution, not the fixed intervals that betray a bot), decoy navigation every few firms, a 60s cookie warm-up at window start, and adaptive backoff on the failure rate, if failures rise, the system slows itself down.
  • Session-state JSON caching with canary-validated auto re-login (a test request that confirms the session is actually valid, not merely present); a refresh daemon probes every 30 minutes and atomically refreshes the storage state, so you never catch a half-written session.
  • A 1h backoff on 3 consecutive failures, a graceful STOP-file (you stop cleanly, not with a brutal kill), a PID single-instance lock, free-disk pre-batch checks, snapshot rotation every 12 batches, and auth refresh every 6 batches.

None of these layers appeared all at once. The 01/05/2026 ban, after just 273 requests in 5.5 hours, is what triggered the full rate-limit redesign. That incident was the lesson: it was not the number of requests that mattered, but how regular their pattern was. The humanization, the log-normal jitter, and the adaptive backoff all came out of it.

Enrichment, scoring and qualification

Enrichment and scoring is where raw data, gathered from several sources, turns into a priority list a salesperson can work top to bottom. Each lead gets a 0-100 score, and the logic is not an intuitive guess but a canonical weighted model, computed identically for every firm: score = 0.30 × revenue (log scale) + 0.30 × international percentage + 0.15 × platform + 0.15 × recency + 0.10 × VIES verified. The weights are not arbitrary: the two factors that best predict whether a firm genuinely needs cross-border shipping, size (revenue) and degree of international exposure, each get 30%, while the supporting signals (platform, recency, VIES) make up the rest.

  • Revenue scores ESTIMATED turnover on a logarithmic scale (1M RON → 0, 50M RON → 100). A log scale is the right choice because the gap between a 1M-RON firm and a 10M-RON one matters enormously for qualification, while the gap between 40M and 50M barely changes the decision, a linear model would have crushed all small firms into one indistinguishable band. When revenue is missing, the model does not discard the lead; it gracefully falls back to employee count as a proxy (1 → 0, 100 → about 60, 1000+ → 100), since team size correlates reasonably with activity volume.
  • International percentage is linear. Here the relationship is direct: the more a firm sells abroad, the more likely it has a recurring need for a cross-border operator, so a linear rise faithfully tracks the rising opportunity.
  • Platform gives 100 for Shopify/WooCommerce/Magento/BigCommerce, 60 for custom/WordPress/Wix, otherwise 30. The reasoning: a mature e-commerce platform signals a real flow of orders that must be shipped, while a custom or brochure site leaves more uncertainty, and the absence of any storefront signal drops the firm to the floor.
  • Recency is 70 if the firm is ANAF-active, plus 30 if an international payment processor (Stripe/PayPal) is present. The ANAF-active status, consumed mainly through ANAF's live 3-endpoint API, confirms the firm exists and operates now; the presence of an international processor is concrete evidence it already takes money from across borders.
  • VIES gives 100 if verified, 0 if not, 30 if unknown (benefit of the doubt). This component was added explicitly because VIES-verified firms are viable for intra-EU shipping: a valid VIES number means the firm can invoice intra-community without VAT, so it is operationally ready for EU deliveries. The score of 30 for "unknown" avoids penalising a firm merely because the check has not yet returned an answer.

Once the score is computed, leads are grouped into tiers that steer the sales effort: ≥80 hot, ≥60 warm, ≥40 cold, the rest frozen. That way the salesperson attacks the "hot" band first, where the probability of conversion per minute invested is highest.

Score alone, however, is not enough, a firm can score high and still be unqualifiable. That is why the qualification gates act as a single source of truth that collects ALL conditions, without short-circuiting on the first rule met, and resolves the verdict by a clear precedence: REJECTED > PENDING > LEADS. This ordering means any reason to reject takes priority, and a lead never advances simply because it ticked one positive condition if a blocking reason exists in parallel. Concretely: I hard-reject struck-off firms (RADIATA) and non-RO legal forms (GMBH/LTD/INC/LLC/BV/OY), because a firm erased from the registry can no longer buy anything, and a foreign entity does not belong to the Romanian target market. I set PENDING when email or an administrator is missing, a lead I have no way to contact is not yet actionable, but it does not deserve to be thrown away either: the PENDING state sends it back to re-enrichment, so the pipeline tries again to fill the missing fields on the next pass. Finally, I mark LEADS only when usable contact data is present, that is, an email, phone, or website, when the salesperson has everything needed to call a real person, at a real firm, with valid fiscal identification. The pipeline thus contributed to turning a raw list into an ordered flow, in which every lead that reaches the phone has already been verified, scored and prepared.

Deduplication and integrity

In a national-scale B2B lead-generation effort, the same company name shows up in dozens of variants: with and without "SRL," with or without diacritics, upper-cased or lower-cased, abbreviated or spelled out in full. If you rely on the name to identify a firm, you inevitably end up counting the same prospect three or four times, and a pipeline that duplicates companies quickly loses credibility with the sales team. That is why the identity key is never the name, it is the CUI (the Romanian unique fiscal identifier), normalized to 6-10 digits. Normalization means stripping the "RO" prefix, removing whitespace and padding, and reducing every written form down to the same digit string, so "RO XXXXXXXX," "XXXXXXXX," and "roXXXXXXXX" all become the same canonical key.

Deduplication runs on that key. When the same firm is seen across several sources, the live 3-endpoint ANAF API for official fiscal data, the live Google Maps and Google Business Profile sources for local presence and contact details, and OpenCorporates as a live corporate-data source, each one brings a different slice of truth. ANAF knows whether the firm is VAT-registered and currently active; Google Maps knows the physical address, opening hours, and reviews; OpenCorporates carries the legal structure and history. The merge logic does not arbitrarily pick a winner; it keeps the richest record. When two entries share the same normalized CUI, they collapse into a single canonical record that accumulates the most complete fields from each source, rather than overwriting one another and losing information.

That collapse has two properties I deliberately engineered for. First: zero duplicates. No matter how many sources see the same firm, the final output holds exactly one row per CUI, so the lead count reported is the real number of distinct companies, not an inflated figure. Second: no write races. Because the pipeline processes up to about 10,000 firms a day through a single BrightData proxy account tuned to that throughput, multiple flows can touch the same record almost simultaneously. If writes landed directly on the data file, two concurrent updates could step on each other and leave a corrupted, half-written record. The fix is the atomic temp-then-replace merge: the new version of the data is written in full to a temporary file, and only once that write completes does that file replace the old version through a single rename operation. The rename is atomic at the filesystem level, so a reader sees either the complete old state or the complete new state, never a partial mix.

On top of this structural integrity sits a layer of evidentiary integrity. Every record carries SHA-256 audit hashes, computed over the content that came in. Their job is not cryptographic security but traceability: I can prove exactly what data entered the system and from which source it came. If someone asks why a firm has a particular address or why it was flagged as active, the hash anchors that field to the precise snapshot it was pulled from. When a source refreshes and a field's value changes, the hash differs from the previous one, signaling that the record has changed, without my having to diff every field by hand. That turns a plain lead table into an auditable dataset, where each claim about a firm can be traced back to its origin.

The starting point made these guarantees impossible. Before, firm verification was done in Excel, filled in by hand with Google searches and questions put to an AI chat, with roughly a one-week lag between the moment a firm came into view and the moment its data was confirmed. Excel enforced no identity key, so the same firm landed on different rows with nobody noticing; it offered no atomicity guarantee, so an interrupted save corrupted the file; and it had no audit hashes at all, so it was impossible to say after the fact where a given value had come from. The combination of the normalized CUI as the key, the atomic temp-then-replace merge, and the SHA-256 hashes replaces that fragility with a process where the same firm seen across several sources collapses into one canonical record, with no duplicates and no write races, a foundation the sales team can trust without checking every line by hand.

Running unattended (DuckDB / WAL / daemon)

If the earlier stages of the pipeline answer "what data do I gather and how do I qualify it", this stage answers the much harder question: "how do I keep this alive for days at a stretch without my hand on it". The starting point was a one-week-per-firm manual process, with enrichment done by hand from Google and a handful of AI chat assistants into a shared Excel sheet. To turn that into infrastructure that runs itself I needed three things: storage I can resume after any crash, a write model with no races, and a single process that orchestrates everything with no human supervisor.

Storage is DuckDB, with an atomic INSERT-OR-REPLACE on the CUI primary key. Choosing the CUI as the primary key is not cosmetic: it is the same normalized CUI used in deduplication, so the same firm seen across several sources in a later cycle does not duplicate, it simply overwrites the prior record with a richer variant. "Atomic" means the write either lands in full or not at all; if the process dies mid-batch, the database stays consistent. Idempotency is completed by resume-from-checkpoint over JSONL: progress state is serialized line by line, and on restart the daemon reads where it left off instead of starting over. That means a power cut or a Windows reboot costs a few minutes, not a day of work.

On top of DuckDB sits a JSON write-ahead log with a SINGLE writer. The reason I insist on "single writer" is concrete: scraping runs with many parallel workers, and if two of them wrote to the same state file at once you would get write races, partial records, lost overwrites, corruption. By funneling every write through one log, I guarantee ordering and eliminate the entire class of concurrency bugs, without needing a central database server.

The most important operational step was consolidation: I took 19 separate Task-Scheduler jobs and melted them into ONE self-paced daemon. Before, each piece, an ANAF refresh here, an admins pull there, a proxy audit somewhere else, was an independent job, with its own schedule, its own risk of overlap, and its own way of failing silently. Nineteen clocks that did not talk to each other meant nineteen ways to step on one another. A single self-paced daemon plans its own cadence and knows what is running at every moment.

Around the main loop I put the safeguards that separate a script from infrastructure. Before each batch it checks free disk space, so it never starts a write it cannot finish. It rotates a snapshot every 12 batches, a regular restore point, so if one cycle damages the data I have somewhere to roll back to. It refreshes auth every 6 batches, ahead of session expiry, rather than after requests have already started failing. On 3 consecutive failures it enters a 1h backoff, capped at 6h, after which it HALTs, the logic being not to hammer a downed source (exactly the behavior that led to the 01/05/2026 ban), but to back off and, if nothing recovers, stop cleanly instead of burning quota for nothing.

A PID single-instance lock stops two copies of the daemon from launching at once, a classic trap on automated restarts. Shutdown is graceful: it catches SIGINT, SIGTERM and a STOP file, so I can stop the cycle cleanly mid-run without leaving half-written records or orphan scrapers. A health probe runs continuously and, if it goes red, the daemon auto-halts rather than continuing to produce bad data. Separate from the daemon, a launch-guard prevents bulk refreshes from overlapping, keeping total ANAF load under 1 request/second, the limit the bilant endpoint enforces. Finally, failure isolation: each source is guarded individually, so if one goes down, ECRIS not responding, a proxy path blocked, the rest carry on, and the downed firm goes to PENDING to be re-enriched, not lost. That, in the end, is the definition of running unattended.

Once a firm is marked LEADS, the pipeline writes it straight into the FACS CRM, a REST POST to /api/v1/Lead (Basic auth) carrying the account name, CUI, status, address, website and every contact (name, role, emails, phones). Two dedup layers run before the write, a local CUI filter and server-side duplicate detection (409/422), plus a dry-run mode for testing. Qualified leads land in the CRM ready to call, with no manual typing.

Outcome and business impact

The system enriched 8,307 firms at 99% coverage and pushes roughly 1,000 pre-qualified leads per day into the CRM, at a throughput of about 10,000 firms/day on a single proxy account. I went from a manual Excel queue, where each firm was checked one at a time through Google searches and questions put to an AI chat, with roughly a one-week lag between a firm entering the list and being ready to contact, to an engine that grinds through thousands of firms a day, unattended.

The substantive difference is not raw speed but the removal of the human touch from the processing loop. In the old way, an operator opened the spreadsheet, copied a company name, searched it on Google, read the results, asked an AI chat to confirm status, and only then noted a verdict. Every step was a human decision, so every firm cost minutes and every day meant a few dozen firms. The new system replaces that manual judgement with authoritative sources queried programmatically: the live ANAF API, hit through its three endpoints for fiscal data, VAT, and status, plus Google Maps and Google Business Profile for real local presence and OpenCorporates for legal-structure verification. Because these sources are queried directly rather than read off a page by a person, the same verdict that used to take minutes now arrives in milliseconds, and the 99% coverage figure is exactly that statement: almost no firm now falls out of enrichment for lack of data.

The throughput of about 10,000 firms/day rests on one deliberate engineering decision: a single BrightData proxy account, tuned rather than multiplied. The obvious temptation would have been to open several accounts to force more requests in parallel, but that would have meant higher cost, more coordination complexity, and greater blocking risk. Instead, the single account was calibrated, request pacing, rotation, back-off, so that it sustains the target volume steadily without tripping anti-bot defenses. It is a do-more-with-less choice: one predictable egress point, easy to monitor and adjust, rather than a fleet of accounts that are hard to keep in sync.

I frame the business impact honestly, the pipeline CONTRIBUTED to these results and enabled them, it was not caused by software alone. The 28% inactive-legacy reactivation became possible because enrichment surfaced firms that were already in the database but looked dead: a human operator never had time to re-check thousands of old accounts, whereas the engine can re-score them all in a single day and flag the ones that have come back to life. The +45% quarter-on-quarter cross-border volume and the 40% shorter sales cycle come from the same mechanism: when the sales team receives roughly 1,000 pre-qualified leads per day, already filtered by need and reachability, they talk to the right firms sooner and waste no time on invalid contacts. The 94% account retention and the 14% lift in regional gross margin are business outcomes in which the pipeline is one of several factors, alongside product, price, and the commercial relationship, not the sole cause.

For me, the real story is not the numbers but the fact that a one-week-per-firm manual process became infrastructure that runs itself, day and night. When a task depends on a person opening a file and searching by hand, it stops when that person goes home; when it becomes an engine with authoritative sources and a finely tuned proxy, it keeps delivering leads at three in the morning. That is the shift that matters: I did not make a manual search faster, I removed it entirely and replaced it with a system that does not tire, does not skip a step, and never needs a break.

Source code: 3 anonymized ANAF/ONRC scripts (GitHub)

Timeline

  • 21/04/2026 First bulk sweep: 1,629 firms processed, 1,477 net-new into the database.
  • 23/04/2026 First production runs: parallel multi-source enrichment behind a quality gate.
  • 24/04/2026 ANAF field extension (5+ fields/firm); estimated revenue added to the score.
  • 30/04/2026 A key source goes Cloudflare-protected on every path; the service is paused.
  • 01/05/2026 Ban after 273 requests in 5.5 hours → a full rate-limiting redesign.
  • 04/05/2026 The full company universe downloaded locally (4 ANAF bulks, 1.4 GB) into DuckDB.
  • 13/05/2026 19 scheduler jobs consolidated into a single self-paced daemon.
  • 15/05/2026 Cloudflare bypassed via a stealth browser; the source is back online.
  • 16/05/2026 10,000 firms/day throughput reached on a single proxy account.
  • 17/05/2026 JSON write-ahead log (write-then-ingest) + a single DuckDB writer process.
  • 24/05/2026 A launch-guard that prevents overlapping bulk-refresh runs.