Understanding Generators in PHP: Benefits, Use Cases, and Common Pitfalls
Posted on April 15, 2024 (Last modified on May 5, 2024) • 10 min read • 2,082 wordsExplore the concept of generators in PHP, including practical applications and pitfalls. Learn how to effectively use the 'yield' keyword for memory-efficient data processing in PHP.
Generator is not a concept that is specific to php. In programming generators can produce individual values at a time instead of generating them all at once. The meaning of the word yield
being to produce something.
In PHP, a function/method is called a generator function/method when
yield
followed by the value being yieldedGenerator
, array
or Traversible
using yield from
keywordsThe classic example presented when PHP Generator
is mentioned is following
function generateNumbers(int $limit):\Generator{
$counter = 0;
while(++$counter < $limit){
yield $counter;
}
}
// usage
foreach(generateNumbers(10000000) as $number){
echo $number;
}
Yes. There is no memory error here because individual value is produced at a time. This just shows a rudimentary example.
Consider following widely used example.
Read a large CSV file as part of ETL (Extract Transform Load)
There are couple of ways to do this.
First thing first, we need a large CSV file. I decided to take “hours watched” by all Netflix subscribers data
Depending on configuration one might already run out of memory trying to read the CSV file and preparing one humongous array that will hold everything. So, this one can be crossed out. So, the good old obvious way is to read the file stream and process each line. Example code as below
function parseAndImport(string $filePath, string $separator = ','): void
{
if (!file_exists($filePath)) {
throw new \RuntimeException('FILE_NOT_FOUND');
}
$file = fopen($filePath, 'r');
if ($file === false) {
throw new \RuntimeException('FILE_OPEN_ERROR');
}
$headers = fgetcsv($file, 4096, $separator);
if ($headers === false) {
throw new \RuntimeException('FILE_INVALID_FORMAT');
}
$headers = array_map('trim', $headers);
$headersCount = count($headers);
$validRows = 0;
while (!feof($file)) {
$row = fgetcsv($file, 4096, $separator);
if ($row === false) {
continue;
}
if (count($row) !== $headersCount) {
continue;
}
$row = array_map('trim', $row);
// there would be few more steps here like cleaning further - validating etc
// everything must be done within this loop
}
}
// example use -
parseAndImport(__DIR__.'/data/sample.csv');
Above solution works but we are cramming everything in one place because we do not have the flexibilty to pass the read stream data point around. Of course, we could have several functions that performs individual tasks like validation/ notification etc. and these function could takes single data point and perform these actions. But why should a function to parse CSV take care of all that? Also notice “And” in the function name. When there is “And” in a function/method name; it might be doing more than one thing and breaking S of SOLID. Now, is perfect time to examine Solution 2.
Consider using Generator for the above scnario. Then the code will look like below. The file is read line by line but the difference here is value is “yielded” back to the caller. The function returns an object which can be passed around.
function parse(string $filePath, string $separator = ','): \Generator
{
if (!file_exists($filePath)) {
throw new \RuntimeException('FILE_NOT_FOUND');
}
$file = fopen($filePath, 'r');
if ($file === false) {
throw new \RuntimeException('FILE_OPEN_ERROR');
}
$headers = fgetcsv($file, 4096, $separator);
if ($headers === false) {
throw new \RuntimeException('FILE_INVALID_FORMAT');
}
$headers = array_map('trim', $headers);
$headersCount = count($headers);
$validRows = 0;
while (!feof($file)) {
$row = fgetcsv($file, 4096, $separator);
if ($row === false) {
continue;
}
if (count($row) !== $headersCount) {
continue;
}
$row = array_map('trim', $row);
$validRows++;
yield array_combine($headers, $row);
}
return $validRows;
}
}
// example usage
$readData = parse(__DIR__.'/data/sample.csv');
// $readData can be passed to some other service that will loop through it and peform necessary actions
So, Generator
does not bring any memory optimization in itself but a simpler version of Iterator
, that both can be iterated. Only difference is that Iterable
can be rewind
while Generator
are forward only once generation has been started with a foreach
. Trying to call rewind
once the generation has been started will throw exception. I have seen numerous blogs where reading CSV file is shown and mentioned that Generator
is the only way to achieve reading such large file without causing memory error. That is not entirely true. The power of Generator
comes because of the generator object which can be passed around similar to an Iterator
object.
Now lets explore some use cases
Generator
is iterable
Another aspect worth mentioning is a Generator is iterable
. Consider a scnario where we need to gather data from various data sources perhaps some data comes from a file, while other from database and some static data etc. and we want to have an interface that returns an iterable
. Then we can return an array
or Traversible
. iterable
is an alias for both array
Traversible
. See . This allows polymorphic behaviour where method can work same way regardless as long as the method return either array
, Traversible
or Generator
.
<?php
declare(strict_types = 1);
interface DataFetcherInterface
{
public function fetch(): iterable;
}
class DbDataFetcher implements DataFetcherInterface
{
public function fetch(): iterable
{
// get data from database and yield
yield 'data from DB';
}
}
class FileDataFetcher implements DataFetcherInterface
{
public function fetch(): iterable
{
// get data from file and yield
yield 'data from file';
}
}
class StaticDataFetcher implements DataFetcherInterface
{
public function fetch(): iterable
{
// get data from file and yield
return ['static data'];
}
}
class DataProcessService
{
public function process(): void
{
// these fetchers can be added to this service but for example, lets create an array with all the fetchers
$fetchers = [
new DbDataFetcher(),
new FileDataFetcher(),
new StaticDataFetcher(),
];
foreach ($fetchers as $fetcher) {
$this->doProcess($fetcher);
}
}
private function doProcess(DataFetcherInterface $fetcher): void
{
foreach ($fetcher->fetch() as $data) {
// do something intersting with this data
}
}
}
// usage
$service = new DataProcessService();
$service->process();
Consider a remote API that returns paginated data e.g. 1000 rows per page. Consider each data point from the API response needs some further processing which can include consolidation of the data etc. before it is stored. Then API client can be written with a generator method that fetches new page data and yield the fetched data.
<?php
class ApiClient
{
public function fetch(): Generator
{
$currentPage = 1;
while ($currentPage !== null) {
$data = $this->callApi($currentPage++);
// response validation could be here
$currentPage = $data['next_page'];
yield from $data;
}
}
private function callApi(int $nextPage): array
{
// mock api call
if ($nextPage === 1) {
return [
'next_page' => 1,
'data' => [['id' => 1, 'description' => 'This is some description']],
];
}
return [
'next_page' => null,
'data' => [['id' => 2, 'description' => 'This is some other description']],
];
}
}
$client = new ApiClient();
foreach ($client->fetch() as $data) {
// process data further
}
Please read what is a data provider here if you are not familiar with it. But in short
In PHPUnit, dataProvider efficiently supplies varied data sets to a test method for reuse, enhancing modular testing. @dataProvier annotation is used to indicate the method which can provide the data for the test method
The requirement for data provider method are
Requirement no. 2 is of interest here as the data provider method can return iterable
aka array
or Generator
or Traversible
Consider following Test
class MyTest extends PHPUnit\Framework\TestCase
{
/**
* @test
* @dataProvider performOperationDataProvider
*/
public function testSomething(int $input, int $expectedResult): void
{
$this->assertSame($expectedResult, $this->performOperation($input));
}
public static function performOperationDataProvider(): iterable
{
return [
'case with input 5' => [
'input' => 5,
'expectedResult' => 100,
],
'case with input 6' => [
'input' => 8,
'expectedResult' => 200,
],
'case double 1' => [
'input' => 100,
'expectedResult' => 200,
],
'case double 2' => [
'input' => 50,
'expectedResult' => 100,
],
];
}
private function performOperation(int $value): int
{
if ($value === 5) {
return 100;
}
if ($value === 8) {
return 200;
}
return $value * 2;
}
}
Running test with –testdox formatter
~/user_me ❯ phpunit MyTest.php --testdox
PHPUnit 9.5.9 by Sebastian Bergmann and contributors.
My Test
✔ Something with data set "case with input 5"
✔ Something with data set "case with input 6"
✔ Something with data set "case double 1"
✔ Something with data set "case double 2"
Time: 00:00.008, Memory: 22.25 MB
OK (4 tests, 4 assertions)
The keys of the returned iterable
is used to show the cases. If there is a need to provide more data to the test method then the returned array
in performOperationDataProvider
can start to be unreadble. Use of Generator
can help to make it more readable.
public static function performOperationDataProvider(): iterable
{
yield 'case with input 5' => [
'input' => 5,
'expectedResult' => 100
];
yield 'case with input 6' => [
'input' => 8,
'expectedResult' => 200
];
yield 'case double 1' => [
'input' => 100,
'expectedResult' => 200
];
yield 'case double 2' => [
'input' => 50,
'expectedResult' => 100
];
}
If there is something that needs to be added between first case and second case. It is more transparent compared to having everything packed into a single array that is created at the beginning. There might be differing opinion on this but on personal level this looks a lot cleaner.
If the only tool you have is a hammer, it is tempting to treat everything as if it were a nail - Maslow
Everything is a matter of balance. Generator can provide the optimization but speed of execution is sacrified for the memory optimization. The choose is clear speed vs memory use. If both are of essense then the solution should be studied carefully. There might be a situation that the software does not cause memory error but takes too long to frustrate the end user anyway.
But in some cases where whole iteration is not needed the benefit can be both in terms of speed and memory.
iterator_to_array
This function can convert a Generator
to array
. Cool! but there is a catch! This function also can cause problems and source of bugs that are hard to locate.
Take following example
<?php
function functionA(): iterable
{
yield from ['A' => 1, 'B' => 2, 'C' => 2];
yield from functionB();
}
function functionB(): iterable
{
yield from ['A' => 4, 'B' => 5, 'C' => 6];
}
$result = functionA();
function loop(iterable $iterable)
{
// for each
foreach ($iterable as $key => $value) {
echo $key . '=' . $value . PHP_EOL;
}
}
echo 'Result converting Generator to array - with preserve key true' . PHP_EOL;
loop(iterator_to_array(functionA()));
echo 'Result without converting Generator to array' . PHP_EOL;
loop(functionA());
Result
Result converting Generator to array - with preserve key true (default)
A=4
B=5
C=6
Result without converting Generator to array
A=1
B=2
C=2
A=4
B=5
C=6
As you can see the first two result is not the expected. The correct data is obtained when actually iterating the Generator
object. For consumer of a Generator
it might not be always possible to know how the values are produced. So, it is better not to rely on iterator_to_array
rather build the array
. If there are duplicate keys then only the last key, value pair is return using iterator_to_array
function functionA(): iterable
{
throw new RuntimeException('SOME_ERROR');
yield 'C' => 5;
}
function functionB():\Generator{
try {
return functionA();
}catch (\Throwable $throwable){
// never thrown
echo $throwable->getMessage();
}
echo 'recovered from error';
// do something
}
$result = functionB();
foreach ($result as $b){
}
On a quick glance - calling functionB
shall never throw an exception.
One will never know until an exception is thrown when iterating $result
. Now is the time to preach the Russian proverb - Trust but verify.
The reason why calling functionB()
will not throw an exception is that it returns a Generator
and the execution is started when the generator is being iterated. This pitfall can be avoided by calling a rewind
method which executes the generator function until the first yield
.