How to iterate huge query results in Magento 2
Nov 12, 2018
Working with Magento 2 we often use collections to fetch and work with data from our database. For the most part these work perfectly fine, but when it comes to fetching and iterating millions of records in a single process it becomes extremely difficult because memory usage becomes a problem. From my experience the overhead of using a collection is too high when dealing with result sets of this magnitude, so we need to look for an alternative solution. In this article I’m going to talk about how to handle huge result sets in Magento without crippling your PHP process.
Querying and loading huge results
Magento 2 ships with a helper (Magento\Framework\Model\ResourceModel\Iterator
) that enables us to iterate huge
database result sets by applying callbacks to each row in the result. This is perfect when you have millions of records
that you need to process without loading them all into memory. Let’s look at this in a real example:
A few things are happening here, so let’s look at them:
- We are injecting an
IteratorFactory
which is an auto-generated factory from Magento’s dependency injection container - We are creating a simple
SELECT
query to fetch data from the huge table, intentionally omitting theOFFSET
/LIMIT
(more on this below) - We are creating an iterator and walking through each result in the query by passing a single callback function (notice that we can provide many callbacks here if we wanted to)
The intention here is to only ever load a single row into memory at any time (unless you assign it to a variable outside
the scope of the callback function passed to $iterator->walk()
). If we achieve this then it does not matter how many
rows we need to process, it should not increase our memory consumption.
Why not just use OFFSET
/LIMIT
?
The first thing that might stand out here is that we are not fetching batches of results using MySQL’s OFFSET
and
LIMIT
expressions. This is because this approach is incredibly wasteful
at the database level. If we issued a query such as SELECT name FROM users OFFSET 10000 LIMIT 1000
we are basically
telling the database to fetch 11000 records and throw the first 10000 away. By avoiding this approach and only executing
a single query in the database we are being more efficient with our computing resources.
PHP’s MySQL memory management
Our solution above looks perfectly OK. We execute a single query and only ever load one row into memory. However, there is a problem with the way PHP handles MySQL query results by default. When we execute that single query PHP tells MySQL: “Hey, give me this entire query result and I will store it in my PHP memory and you can forget about it”, and it does this even though we haven’t asked for anything to be loaded into memory other than a single row. Even though PHP does this for very good reasons (to allow us to do result counts in memory etc.), it is completely the opposite of what we want when dealing with result sets of this magnitude.
“Queries are using the buffered mode by default. This means that query results are immediately transferred from the MySQL Server to PHP and then are kept in the memory of the PHP process.” http://php.net/manual/en/mysqlinfo.concepts.buffering.php
This concept is called query buffering, and luckily there is a way to turn it off in situations like this. If we turn off query buffering we end up with an unbuffered query result which is held on our database server and returned to us row-by-row or however we want to fetch it in our PHP process.
“Unbuffered MySQL queries execute the query and then return a resource while the data is still waiting on the MySQL server for being fetched. This uses less memory on the PHP-side, but can increase the load on the server.” http://php.net/manual/en/mysqlinfo.concepts.buffering.php
This is exactly what we want when dealing with a huge result set: a small, predictable memory footprint in our PHP process (which is usually running inside a resource restricted boundary, e.g. a Docker container). So how do we get unbuffered query results in Magento? Let’s take a look…
Creating an unbuffered query connection
If we have a look at the official PHP documentation on query buffering
we can see that there is an attribute that can be set on the PDO
object after connecting:
In Magento we don’t readily have access to the underlying PDO
object so we must use an alternative approach. Instead
we must create a new instance of Magento\Framework\DB\Adapter\AdapterInterface
(which is the interface that Magento DB
connections implement) that has this attribute set on it. The easiest way to do this is to fetch the current connection
from the object manager and then re-create it so that it returns unbuffered queries when used. Let’s have a look at an
example of this factory:
There are a few things happening here, so let’s break it down:
- We inject
ResourceConnection
into our factory that gives us access to existing connections - In our
factory()
method we take a connection name and fetch the actual connection (if no name is provided then the default connection is returned) - We modify the configuration data for the connection to set
use_buffered_query
tofalse
- We return a new instance of the same connection type, with this new configuration
- When the new connection is used it will return unbuffered query results
So now, if we want to fetch an unbuffered query result in our example from the start of this article, it would look something like this:
Notice how we are now no longer using ResourceConnection
in our result iterator, instead we use our new factory that
fetches unbuffered query results for us. One important thing to note here is that when fetching unbuffered query results
no other queries can be issued over the same connection until the entire result has been fetched. This means that if we
wanted to issue, for example, an UPDATE
or INSERT
query for each row in our result then we would need to fetch a
normal connection from ResourceConnection
and issue these queries over that instead.