How to process large datasets?

I need to create a workflow that processes about 250k contact entries. (I’d prefer to only proecess a changeset, but that doesn’t seem to be possible currently). The workflow needs run the following tasks:

  • loop through a webservice that returns a max of 1000 records (id + email)
  • filter out unnecessary records (missing or wrong email)
  • loop through the rest of the1000 records and make a detail request
  • filter out other records not meeting quality criterias
  • lookup existing records by foreign key in 1CRM and do an update/insert (these steps could be extended, as we can extend the API)

An extra bonus is the rate limit of 1000 requests/minute for the first API (at least it will return a header with remaining capacities).

So at a small scale, I have this up and running, currently the rate limit and triggerless design are the main issues.

My idea was, to split the whole process into several small workflows:

  1. a counter with a delay that calls wf #2 with an integer 1-250 (or more, to lower the number in #2 a bit)
  2. the 1000 records request, that uses #1 for pagination, handles the first filter and forwards smaller chunks of data to #3
  3. as the detail lookup is able to process multiple records at once, I would use this to max out the rate limit, filter results and forward them to #4
  4. (not sure, if I’ll put this directly into #3 as it will make data handling easier) lookup 1CRM record and do the upsert

this way I expect error tracking being easier and the smaller workflows with external triggers should be easier to debug. As an alternative to #1 I could use a wf with a cron trigger, that stores the counter in global wf-context.

Is this a feasible way to solve this? Are there limitations that need to be handled? Any Tips?