Friday, August 19, 2016

Micro benchmarking libraries for C++

Intro

The timer I’ve introduced recently is easy to use, but also returns just the basic information: elapsed time for an execution of some code… what if we need more advanced data and more structured approach of doing benchmarks in the system?

My approach:
  1. timer start = get_time();

  2. // do something
  3. // ...
report_elapsed(start - get_time());
The above code lets you do some basic measurements to find potential hotspots in your application. For example, sometimes I’ve seen bugs like this (document editor app):
BUG_1234: Why is this file loading so long? Please test it and tune the core system that causes this!
To solve the problem you have to find what system is responsible for that unwanted delay. You might use a profiling tool or insert your timer macros here and there.

After the bug is fixed, you might leave such code (in a special profile build setup) and monitor the performance from time to time.

However, the above example might not work in situations where performance is critical: in subsystems that really have to work fast. Monitoring it from time to time might give you even misleading results. For those areas it might be better to implement a microbenchmarking solution.

Microbenchmarking

From wikipedia/benchmark

Component Benchmark / Microbenchmark (type):
  • Core routine consists of a relatively small and specific piece of code.
  • Measure performance of a computer’s basic components
  • May be used for automatic detection of computer’s hardware parameters like number of registers, cache size, memory latency, etc.
Additional answer from SO - What is microbenchmarking?

In other words, microbenchmark is a benchmark of an isolated component, or just a method. Quite similar to unit tests. If you have a critical part of your system, you may want to create such microbenchmarks that execute elements of that system automatically. Every time there is a ‘bump’ in the performance you’ll know that quickly.

I’ve seen that there is a debate over the internet (at least I’ve seen some good questions on SO related to this topic…) whether such microbenchmarking is really important and if it gets valuable results. Nevertheless it’s worth trying or at least it’s good to know what options do we have here.

BTW: here is a link to my question on reddit/cpp regarding micro benchmarking: Do you use microbenchmarks in your apps?

Since it’s a structured approach, there are ready-to-use tools that enables you to add such benchmarks quickly into your code.

I’ve tracked the following libraries:
  • Nonius
  • Hayai
  • Celero
  • Google Benchmark(*)
Unfortunately with Google Benchmark I couldn’t compile it on Windows, so my notes are quite limited. Hopefully this will change when this library is fully working on my Windows/Visual Studio environment.

Test Code

Repo on my github: fenbf/benchmarkLibsTest

To make it simple, I just want to measure execution of the following code:

auto IntToStringConversionTest(int count)
  1. {
  2.     vector<int> inputNumbers(count);
  3.     vector<string> outNumbers;

  4.     iota(begin(inputNumbers), end(inputNumbers), 0);
  5.     for (auto &num : inputNumbers)
  6.         outNumbers.push_back(to_string(num));

  7.     return outNumbers;
  8. }
and the corresponding test for double:
  1. auto DoubleToStringConversionTest(int count)
  2. {
  3.     vector<double> inputNumbers(count);
  4.     vector<string> outNumbers;

  5.     iota(begin(inputNumbers), end(inputNumbers), 0.12345);
  6.     for (auto &num : inputNumbers)
  7.         outNumbers.push_back(to_string(num));

  8.     return outNumbers;
  9. }
The code creates a vector of numbers (int or double), generates numbers from 1 up to count (with some offset for the double type), then converts those numbers into strings and returns the final vector.

BTW: you might wonder why I’ve put auto as the return type for those functions… just to test new C++14 features :) And it looks quite odd, when you type full return type it’s clearer what the method returns and what it does…

Hayai library

Github repo: nickbruun/hayai, Introductory article by the author

Library was implemented around the time the author was working on a content distribution network. He often needed to find bottlenecks in the system and profiling become a key thing. At some point, instead of just doing stop-watch benchmarking… he decided to go for something more advanced: a benchmarking framework where the team could test in isolation crucial part of the server code.

Hayai - “fast” in Japanese, is heavily inspired by Google Testing Framework. One advantage: it’s a header only, so you can quickly add it to your project.

Update: After I’ve contacted the author of the library it appears this tools is more powerful than I thought! It’s not documented so we need to dig into the repo to find it :)

A simplest example:
  1. #include <hayai.hpp>

  2. BENCHMARK(MyCoreTests, CoreABCFunction, 10, 100)
  3. {
  4.     myCoreABCFunction();
  5. }
  • first param: group name
  • second: test name
  • third: number of runs
  • fourth: number of iterations
In total myCoreABCFunction will be called num_runs x num_iterations. Time is measured for each run. So if your code is small and fast you might increase the number of iterations to get more reliable results.

Or an example from my testing app:
  1. #include "hayai.hpp"

  2. BENCHMARK(ToString, IntConversion100, 10, 100)
  3. {
  4.     IntToStringConversionTest(TEST_NUM_COUNT100);
  5. }

  6. BENCHMARK(ToString, DoubleConversion100, 10, 100)
  7. {
  8.     DoubleToStringConversionTest(TEST_NUM_COUNT100);
  9. }

  10. int main()
  11. {
  12.     // Set up the main runner.
  13.     ::hayai::MainRunner runner;

  14.     // Parse the arguments.
  15.     int result = runner.ParseArgs(argc, argv);
  16.     if (result)
  17.         return result;

  18.     // Execute based on the selected mode.
  19.     return runner.Run();
  20. }
When you run this, we’ll get the following possible results:


As you can see we get average/min/max for runs and also for iterations.

In more advanced scenarios there is an option to use fixtures (with SetUp() and TearDown() virtual methods).

If we run the binary with --help parameter we get the this list of options:


In terms of output, the library can use only console (correction). It can output to json, junit xml or normal console output. So it’s possible to take the data and analyse it in a separate tool.

Celero library

Github repository: DigitalInBlue/Celero, CodeProject article, Another CodeProject article with examples

Celero goes a bit further and introduces concept of the baseline for the testing code. You should first write your basic solution, then write another benchmarks that might improve (or lower) the performance of the baseline approach. Especially useful when you want to compare between several approaches of a given problem. Celero will compare between all the versions and the baseline.

The library is implemented using the latest C++11 features and it’s not header only. You have to first build a library and link to your project. Fortunately it’s very easy because there is a CMake project. Works in GCC, Clang and VisualStudio and other modern C++ compilers.

Example from my testing app:
  1. #include "celero\Celero.h"
  2. #include "../commonTest.h"

  3. CELERO_MAIN;

  4. BASELINE(IntToStringTest, Baseline10, 10, 100)
  5. {
  6.     IntToStringConversionTest(TEST_NUM_COUNT10);
  7. }

  8. BENCHMARK(IntToStringTest, Baseline1000, 10, 100)
  9. {
  10.     IntToStringConversionTest(TEST_NUM_COUNT1000);
  11. }

  12. BASELINE(DoubleToStringTest, Baseline10, 10, 100)
  13. {
  14.     DoubleToStringConversionTest(TEST_NUM_COUNT10);
  15. }

  16. BENCHMARK(DoubleToStringTest, Baseline1000, 10, 100)
  17. {
  18.     DoubleToStringConversionTest(TEST_NUM_COUNT1000);
  19. }
Similarly to Hayai library, we can specify the group name, test name number of samples (measurements) to take and number of operations (iterations) the the code will be executed.

What’s nice is that when you pass 0 as the number of samples, Celero will figure out the proper number on its own.

The output:

Other powerful features:
  1. As in other solutions, there is an option to use fixtures in your tests.
  2. Celero gives you a code celero::DoNotOptimizeAway that can be used to make sure the compiler won’t remove your code from the final binary file.
  3. Celero can automatically run threaded benchmarks.
  4. There is an option to run benchmark in time limit (not execution number limit), so you can run your benchmark for 1 second for example.
  5. The library lets you define a problem space: for example when you’re testing an algorithm you can provide several N values and for each N complete set of benchmarks will be executed. This might be useful for doing graphs from your results.
  6. You can output data to CSV, JUnit xml, or even archive old result file.
Nonius library

The main site - nonius.io, Github repo - rmartinho/nonius

Nonius (in fact it’s a name of a astrolabe device) is a library that goes a bit beyond the basic measurements and introduces some more statistics to our results.

One outcome of this idea is that you don’t have to pass number of runs or iterations of your code. The library will figure it out (Celero had some part of that idea implemented, in Hayai there is no such option yet).

Nonius runs your benchmark in the following steps:
  • Taking environmental probe: like timer resolution. This doesn’t need to be executed for each benchmark.
  • Warm up and estimation: your code is run several times to estimate how many times it should be finally executed.
  • The main code execution: benchmark code is executed number of times (taken from the step 2) and then samples are computed.
  • Magic happens: bootstapping is run over the collected samples
The library uses modern C++ and is header only. I had no problem in adding this to my sample project. Maybe there was one additional step: you need to have boost installed somewhere, because the library depends on it. Nonius uses std::chrono internally, but if you cannot rely on it (for example because you’re using VS2013 which has a bug in the implementation of std::chrono) then you might define NONIUS_USE_BOOST_CHRONO and then it will use Boost libraries.

Example from my testing app:
  1. #define NONIUS_RUNNER
  2. #include "nonius.h++"
  3. #include "../commonTest.h"
  4. NONIUS_BENCHMARK("IntToStringTest1000", [] 
  5. {
  6.     IntToStringConversionTest(TEST_NUM_COUNT1000);
  7. })
  8. {
  9.     DoubleToStringConversionTest(TEST_NUM_COUNT1000);
  10. })
we get the following output:


Here we have to read the output more carefully.

I’ve mentioned that after the data is collected bootstrapping is executed, so we get a bit more detailed results:
  1. there is a mean, upper bound and lower bound of the samples
  2. standard deviation
  3. outliers: samples that are too far from the mean and they may disturb the final results.
As you can see you get a very interesting data! If, for example, some unexpected job was running (a video player, power saving mode, …) during the benchmark execution you should caught it because outliers will point that the results are probably invalid or heavily disturbed.

By specifying -r html -o results.html we can get a nice graph (as one HTML page):

Other features:
  • Fixtures can be used
  • if the benchmark consists of one function call like myCompute() you can just write return myCompute() and the library guarantees that the code won’t be optimized and removed.
  • nonius::chronometer meter input parameter that can be used to perform more advanced tests.
  • there is a method to separate construction and destruction code from the actual code: nonius::storage_for<T>
Google Benchmark library

Comparison:

Date of writing: 12 May 2016


Summary

In this article I went through three libraries that lets you create and execute micro benchmarks. All of those libraries are relatively easy to add into your project (especially Hayai and Nonius which are header only). To use Celero you just have to link to its lib.

Hayai seems to be the simplest solution out of those three. It’s very easy to understand and but you get a decent set of functionality: console, junit xml or json output, benchmarks randomization order, benchmark filtering.

Celero has lots of features, probably I didn’t cover all of them in this short report. This library seems to be the most advanced one. It uses Baselines for the benchmarks. Although the library is very powerful it’s relatively easy to use and you can gradually use some more complex features of it.

Nonius is probably the nicest. If offers powerful statistic tools that are used to analyse samples, so it seems it should give you the most accurate results. I was also impressed by the number of output formats: even html graph form.
Written by Bartlomiej Filipek

If you found this post interesting, follow and support us.
Suggest for you:

Learn to Program with C++

Object Oriented Programming with C++

C++ Programming from Zero to Hero:  The Fundamentals

Advanced C++ Programming Training Course

C++: From Beginner to Expert

Thursday, August 18, 2016

Custom Deleters for C++ Smart Pointers


Let’s say we have the following code:
  1. LegacyList* pMyList = new LegacyList();
  2. ...
  3. pMyList->ReleaseElements();
  4. delete pMyList;
In order to fully delete an object we need to do some additional action.

How to make it more C++11? How to use unique_ptr or shared_ptr here?

Intro

We all know that smart pointers are really nice things and we should be using them instead of raw new and delete. But what if deleting a pointer is not only the thing we need to call before the object is fully destroyed? In our short example we have to call ReleaseElements() to completely clear the list.

Side Note: we could simply redesign LegacyList so that it properly clears its data inside its destructor. But for this exercise we need to assume that LegacyList cannot be changed (it’s some legacy, hard to fix code, or it might come from a third party library).

ReleaseElements is only my invention for this article. Other things might be involved here instead: logging, closing a file, terminating a connection, returning object to C style library… or in general: any resource releasing procedure, RAII.

To give more context to my example, let’s discuss the following use of LegacyList:
  1. class WordCache {
  2. public:
  3.     WordCache() { m_pList = nullptr; }
  4.     ~WordCache() { ClearCache(); }

  5.     void UpdateCache(LegacyList *pInputList) { 
  6.         ClearCache();
  7.         m_pList = pInputList;
  8.         if (m_pList)
  9.         {
  10.             // do something with the list...
  11.         }
  12.     }

  13. private:
  14.     void ClearCache() { 
  15.         if (m_pList) { 
  16.             m_pList->ReleaseElements();
  17.             delete m_pList; 
  18.             m_pList = nullptr; 
  19.         } 
  20.     }

  21.     LegacyList *m_pList; // owned by the object
  22. };
You can play with the source code here: using Coliru online compiler.

This is a bit old style C++ class. The class owns the m_pList pointer, so it has to be cleared in the constructor. To make life easier there is ClearCache() method that is called from the destructor or from UpdateCache().

The main method UpdateCache() takes pointer to a list and gets ownership of that pointer. The pointer is deleted in the destructor or when we update the cache again.

Simplified usage:
  1. WordCache myTestClass;

  2. LegacyList* pList = new LegacyList();
  3. // fill the list...
  4. myTestClass.UpdateCache(pList);

  5. LegacyList* pList2 = new LegacyList();
  6. // fill the list again
  7. // pList should be deleted, pList2 is now owned
  8. myTestClass.UpdateCache(pList2);
With the above code there shouldn’t be any memory leaks, but we need to carefully pay attention what’s going on with the pList pointer. This is definitely not modern C++!

Let’s update the code so it’s modernized and properly uses RAII (smart pointers in these cases). Using unique_ptr or shared_ptr seems to be easy, but here we have a slight complication: how to execute this additional code that is required to fully delete LegacyList ?

What we need is a Custom Deleter

Custom Deleter for shared_ptr

I’ll start with shared_ptr because this type of pointer is more flexible and easier to use.

What should you do to pass a custom deleter? Just pass it when you create a pointer:
  1. std::shared_ptr<int> pIntPtr(new int(10), 
  2.     [](int *pi) { delete pi; }); // deleter 
The above code is quite trivial and mostly redundant. If fact, it’s more or less a default deleter - because it’s just calling delete on a pointer. But basically, you can pass any callable thing (lambda, functor, function pointer) as deleter while constructing a shared pointer.

In the case of LegacyList let’s create a function:
  1. void DeleteLegacyList(LegacyList* p) {
  2.     p->ReleaseElements(); 
  3.     delete p;
  4. }
The modernized class is super simple now:
  1. class ModernSharedWordCache {
  2. public:
  3.     void UpdateCache(std::shared_ptr<LegacyList> pInputList) { 
  4.         m_pList = pInputList;
  5.         // do something with the list...
  6.     }
  7. private:
  8.     std::shared_ptr<LegacyList> m_pList;
  9. };
  • No need for constructor - the pointer is initialized to nullptr by default
  • No need for destructor - pointer is cleared automatically
  • No need for helper ClearCache - just reset pointer and all the memory and resources are properly cleared.
When creating the pointer we need to pass that function:
  1. ModernSharedWordCache mySharedClass;
  2. std::shared_ptr<LegacyList> ptr(new LegacyList(),
  3.                                 DeleteLegacyList)
  4. mySharedClass.UpdateCache(ptr);
As you can see there is no need to take care about the pointer, just create it (remember about passing a proper deleter) and that’s all.

Were is custom deleter stored?

When you use a custom deleter it won’t affect the size of your shared_ptr type. If you remember, that should be roughly 2 x sizeof(ptr) (8 or 16 bytes)… so where does this deleter hide?

shared_ptr consists of two things: pointer to the object and pointer to the control block (that contains reference counter for example). Control block is created only once per given pointer, so two shared_pointers (for the same pointer) will point to the same control block.

Inside control block there is a space for custom deleter and allocator.

Can I use make_shared?

Unfortunately you can pass a custom deleter only in the constructor of shared_ptr there is no way to use make_shared. This might be a bit of disadvantage, because as I described in Why create shared_ptr with make_shared? - from my old blog post, make_shared allocates the object and its control block for it next to each other in memory. Without make_shared you get two, probably separate, blocks of allocated mem.

Update: I got a very good comment on reddit: from quicknir saying that I am wrong in this point and there is something you can use instead of make_shared.

Indeed, you can use allocate_shared and leverage both the ability to have custom deleter and being able to share the same memory block. However, that requires you to write custom allocator, so I considered it to be too advanced for the original article.

Custom Deleter for unique_ptr

With unique_ptr there is a bit more complication. The main thing is that a deleter type will be part of unique_ptr type.

By default we get std::default_delete:
  1. template <
  2.     class T,
  3.     class Deleter = std::default_delete<T>
  4. > class unique_ptr;
Deleter is part of the pointer, heavy deleter (in terms of memory consumption) means larger pointer type.

What to chose as deleter?

What is best to use as a deleter? Let’s consider the following options:
  1. std::function
  2. Function pointer
  3. Stateless functor
  4. State-full functor
  5. Lambda
What is the smallest size of unique_ptr with the above deleter types? Can you guess? (Answer at the end of the article)

How to use?

For our example problem let’s use a functor:
  1. struct LegacyListDeleterFunctor {  
  2.     void operator()(LegacyList* p) {
  3.         p->ReleaseElements(); 
  4.         delete p;
  5.     }
  6. };
And here is a usage in the updated class:
  1. class ModernWordCache {
  2. public:
  3.     using unique_legacylist_ptr = 
  4.        std::unique_ptr<LegacyList,  
  5.           LegacyListDeleterFunctor>;

  6. public:
  7.     void UpdateCache(unique_legacylist_ptr pInputList) { 
  8.         m_pList = std::move(pInputList);
  9.         // do something with the list...
  10.     }

  11. private:
  12.     unique_legacylist_ptr m_pList;
  13. };
Code is a bit more complex than the version with `shared_ptr` - we need to define a proper pointer type. Below I show how to use that new class:
  1. ModernWordCache myModernClass;
  2. ModernWordCache::unique_legacylist_ptr pUniqueList(new LegacyList());
  3. myModernClass.UpdateCache(std::move(pUniqueList));
All we have to remember, since it’s a unique pointer, is to move the pointer rather than copy it.

Can I use make_unique?

Similarly as with shared_ptr you can pass a custom deleter only in the constructor of unique_ptr and thus you cannot use make_unique. Fortunately, make_unique is only for convenience (wrong!) and doesn’t give any performance/memory benefits over normal construction.

Update: I was too confident about make_unique :) There is always a purpose for such functions. Look here GotW #89 Solution: Smart Pointers - guru question 3:

make_unique is important because:
First of all:
Guideline: Use make_unique to create an object that isn’t shared (at
least not yet), unless you need a custom deleter or are adopting a raw
pointer from elsewhere.
Secondly:
make_unique gives exception safety: Exception safety and make_unique

So, by using a custom deleter we lose a bit of security. It’s worth knowig the risk behind that choice. Still, custom deleter with unique_ptr is far more better than playing with raw pointers.

Summary

In this post I’ve shown you how to use custom deleters with C++ smart pointer: shared_ptr and unique_ptr. Those deleters can be used in all the places wher ‘normal’ delete ptr is not enough: when you wrap FILE*, some kind of a C style structure (SDL_FreeSurface, free(), destroy_bitmap from Allegro library, etc).
Remember that proper garbage collection is not only related to memory destruction, often some other actions needs to be invoked. With custom deleters you have that option.

Gist with the code is located here: fenbf/smart_ptr_deleters.cpp
Written by Bartlomiej Filipek


If you found this post interesting, follow and support us.
Suggest for you:






Saturday, August 13, 2016

To Be or Not to Be (an Iterator)


Way back in 1999, when the ink on the first C++ standard was still damp, Herb Sutter posed a GoTW puzzler in the still extant C++ Report (RIP): When Is a Container Not a Container? In that article, Herb described the problems of the now-infamous vector<bool>. According to the standard’s own container requirements, vector<bool> is not a container.

In a nutshell, it’s because vector<bool>‘s iterators claim to be random-access, but they’re not. Random-access iterators, when you dereference them, must return a real reference. They can only do that if the thing they point to really exists somewhere. But the bool that a vector<bool>::iterator points to does not exist anywhere. It’s actually a bit in a packed integer, and dereferencing a vector<bool>‘s iterator returns an object of some type that merely acts like a bool& without actually being a bool&.

Herb goes so far as to say this:
[…] although a proxied collection can be an important and useful tool, by definition it must violate the standard’s container requirements and therefore can never be a conforming container.
At the end of his article, Herb suggests that people stop using vector<bool> and use std::bitset if they want bit-packing. But that just pushes the problem around. Why shouldn’t std::bitset be a conforming container with random-access iterators? If proxied collections are so useful, why should we content ourselves with a standard library that treats them like second-class citizens?

A Brief History of Proxy Iterators

Herb wrote his article in 1999, so we’ve been living with this problem for a long time. Many have tried to fix it and ultimately failed for one reason or another. Mostly it’s because all the solutions have tried to be backwards compatible, shoehorning a richer iterator hierarchy into a standard that doesn’t easily allow it, or else breaking iterators themselves apart into separate objects that control traversal and element access. Each time the committee has balked, preferring instead the devil it knew.

An interesting historical note: the original STL design didn’t have the “true reference” requirement that causes the problem. Take a look at the SGI docs for the Forward Iterator concept. Nowhere does it say that *it should be a real reference. The docs for Trivial Iterators specifically mention proxy references and say they’re legit.

Recently, a who’s who of C++ luminaries put their names on N3351, the so-called Palo Alto TR, which proposes a concept-based redesign of the STL, using the syntax of Concepts Lite. Interestingly, the Palo Alto TR is a throw-back to the original SGI design: there is no “true-reference” requirement on the return type of *it; it merely must be convertible to const ValueType<I> &:
  1. // This must work, according to the Palo Alto TR
  2. const ValueType<I> & val = *it;
It’s not hard for a proxy reference type to provide such a conversion. For instance, the following compiles today:
  1. std::vector<bool> vb{true, false, true, false};
  2. auto it = vb.begin();
  3. const bool & val = *it;
*it has an implicit conversion to bool, which binds to a const bool&. Awesome! So the problem is solved, right? Not quite.

A Panoply of Proxy Problems

To better see the problems with proxy iterators, let’s look at a more interesting example: a zip view. When you zip two sequences together, you get a single sequence where each element is a std::pair of elements from the two source sequences. This can be done lazily, creating pairs on demand as the zip view is iterated:
  1. std::vector<int> v1 { 1,2,3 };
  2. std::vector<int> v2 { 9,8,7 };
  3.  
  4. auto z = view::zip( v1, v2 );
  5. auto it = z.begin();
  6.  
  7. assert( *it   == std::make_pair(1,9) );
  8. assert( *++it == std::make_pair(2,8) );
  9. assert( *++it == std::make_pair(3,7) );
Since the zip view is generating the pairs on demand, they don’t exist anywhere in memory. But the elements they refer to do! See?
  1. std::pair<int&,int&> p = *z.begin();
  2. assert( &p.first  == &v1[0] );
  3. assert( &p.second == &v2[0] );
The zip view is a very interesting beastie. Its reference type is pair<T&,U&> and its value type is pair<T,U>. This poses some very interesting challenges for the iterator concepts.

1. Values and References

Recall that the Palo Alto TR requires *it to be convertible to const ValueType<I>&. So we should be able to do this:
  1. auto z = view::zip( v1, v2 );
  2. const pair<int,int>& val = *z.begin();
That works! As it so happens, there is a conversion from std::pair<T&,U&> to std::pair<T,U> — but there’s a catch: it only works if T and U are copyable! And even when they’re not, it’s clear that copying is not the behavior one would expect when using *it to initialize a const reference. If T or U is expensive to copy, you’re not going to get the performance or the behavior you expect, and if it’s unique_ptr it’s not going to compile at all. 🙁

Requiring that an iterator’s reference type be convertible to const ValueType<I>& is over-constraining. But then what useful thing can we say about the relationship between these two types?

2. Algorithm Constraints

All the algorithm signatures in the Palo Alto TR use ValueType in the concept checks in order to constrain the templates. For example, here’s the constrained signature of for_each:
  1. template<InputIterator I, Semiregular F>
  2.     requires Function<F, ValueType<I>>
  3. F for_each(I first, I last, F f);
If you’re not familiar with C++ concepts, what lines 1 and 2 say is: first and last must satisfy the requirements of the InputIterator concept, F must be Semiregular (I’ll gloss over this bit), and it must be callable with one argument of the iterator’s value type.

Now imagine code like this:
  1. // As before, v1 and v2 are vectors of ints:
  2. auto z = view::zip( v1, v2 );
  3. // Let Ref be the zip iterator's reference type:
  4. using Ref = decltype(*z.begin());
  5. // Use for_each to increment all the ints:
  6. for_each( z.begin(), z.end(), [](Ref r) {
  7.     ++r.first;
  8.     ++r.second;
  9. });
This seems perfectly reasonable. The lambda accepts an object of the zip view’s reference type, which is a pair<int&,int&>, and then it increments both first and second members. But this doesn’t type-check. Why?

Remember the concept check: Function<F, ValueType<I>>. The function we pass to for_each must be callable with an object of the iterator’s value type. In this case, the value type is pair<int,int>. There is no conversion from that to the type the function expects, which is pair<int&,int&>. Bummer.

If we change the lambda to take a pair<int,int>&, then the concept check passes, but the template will fail to instantiate correctly. It’s easy to see why when you look at a typical for_each implementation:
  1. template<InputIterator I, Semiregular F>
  2. requires Function<F, ValueType<I>>
  3. F for_each(I first, I last, F f) {
  4.     for(; first != last; ++first)
  5.         f(*first);
  6.     return f;
  7. }
The lambda is called with *first which has type pair<int&,int&>, but that doesn’t convert to pair<int,int>&. Gah!!!

The most galling bit is that the code we wrote above — the code with the lambda that takes the reference type — works just fine if we simply delete the requires Function<F, ValueType<I>> constraint. Clearly something is wrong with the constraints, the concepts, or our expectations.

I should add that the problem is not specific to the zip view. Any sequence with a proxy reference type has this problem, vector<bool> included. If we just slap these constraints on the existing algorithms, some code that works today will break, and the only “fix” would be to stop using the standard algorithms. 🙁

3. Permutability of Move-Only Types

Unfortunately, the problems don’t end there. The sort algorithm requires a sequence to be permutable; that is, you should be able to shuffle its elements around. And since it should support move-only types, that means that the sequence’s iterators should be indirectly-movable. The Palo Alto TR has this to say about it:
The IndirectlyMovable and IndirectlyCopyable concepts describe copy and move relationships between the values of an input iterator, I, and an output iterator Out. For an output iterator out and an input iterator in, their syntactic requirements expand to:
  • IndirectlyMovable requires *out = move(*in)
But what if *in returns a proxy? Then move(*in) is moving the proxy, not the object to which the proxy refers. In the case of sorting a zip view, we’re trying to move a (temporary) pair<T&,U&> into a pair<T&,U&>. As with issue (1), that won’t work at all for move-only types. But you would probably fail before that, at the sort requires clause, because of issue (2). Sheesh!

Summary, For Now…

Even though the Palo Alto TR lifts the over-constraining requirement that ForwardIterators return real references, the proxy iterator problem remains. On the one hand, it says that proxy iterators are OK. On the other hand, some interesting proxy iterators fail to model the Iterator concept or satisfy the algorithm constraints, and those that do don’t have the right semantics or performance characteristics. What are our options?
  1. The zip view, vector<bool>, and its ilk are useful, but are not legitimate containers and ranges, and the STL can’t support them, full stop; or
  2. The iterator concepts (and probably the algorithm constraints) as specified in the Palo Alto TR need to be tweaked somehow to support proxy iterators, and some algorithm implementations probably need to change, too; or
  3. The language needs to change to better support proxy references (an idea from Sean Parent); or
  4. Something else.
I really don’t like option (1); there are too many interesting forward iterators that can’t return true references, and I’m tired of doing without. I have some rudimentary ideas about option (2) that I plan to describe in my next post. Option (3) can’t be ruled out, but IANALL (I Am Not A Language Lawyer) and have no idea what would be involved. It’s clear that with C++17 shaping up, and with the Concepts Lite TR finally reaching PDTS status , and a range-ified, concept-ified STL in the works, the time to start making decisions about this stuff is now.
Written by Eric Niebler

If you found this post interesting, follow and support us.
Suggest for you:

Introduction to Algorithms and Data structures in C++

C++ Programming from Zero to Hero:  The Fundamentals

Advanced C++ Programming Training Course

C++: From Beginner to Expert

Learn Advanced C++ Programming

Friday, August 12, 2016

A Slice of Python in C++


This post describes a fun piece of hackery that went into my Range-v3 library recently: a Python-like range slicing facility with cute, short syntax. It’s nothing earth-shattering from a functionality point of view, but it’s a fun little case study in library design, and it nicely illustrates my philosophy of library design.

Python Slicing

In Python, you can slice a container — that is, create a view of a contiguous subrange — using a very concise syntax. For instance:
  1. >>> letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
  2. >>> letters
  3. ['a', 'b', 'c', 'd', 'e', 'f', 'g']
  4. >>> # access a subrange with a slice operation
  5. >>> letters[2:5]
  6. ['c', 'd', 'e']
  7. >>> # replace some values
  8. >>> letters[2:5] = ['C', 'D', 'E']
  9. >>> letters
  10. ['a', 'b', 'C', 'D', 'E', 'f', 'g']
On line 5, we access elements of the list letters in the half-open sequence [2,5) using the syntax letters[2:5]. Short and sweet. On line 8, we assign through the slice, which mutates the underlying letters list. That proves that Python slices have reference semantics.

That’s not all that the Python slice operator can do. You can leave off slice offsets, in which case Python takes a smart default:
  1. >>> # A missing first offset means "from the beginning"
  2. >>> letters[:5]
  3. ['a','b','C', 'D', 'E']
  4. >>> # A missing end offset means "to the end"
  5. >>> letters[5:]
  6. ['f','g']
You can even slice from the end with negative offsets:
  1. >>> # Take the last two elements:
  2. >>> letters[-2:]
This is all pretty handy and really cool.

Old-Style Slicing in C++ with Range-v3

My range-v3 library has had a slice operation for a long time now, but it wasn’t as powerful and the syntax wasn’t as cool:
  1. using namespace ranges;
  2. auto letters = view::iota('a','g');
  3. std::cout << letters << '\n';
  4. // prints: {a,b,c,d,e,f,g}
  5. std::cout << (letters | view::slice(2,5)) << '\n';
  6. // prints: {c,d,e}
In the above code, view::iota is a view that generates all the characters from 'a' to 'g' (inclusive), and view::slice is a view of the elements from offset 2 through 5 (exclusive). As with Python’s slice, this slice is lightweight and non-owning.

This syntax is not terrible per se, but it’s certainly not as fun as Python’s. And view::slice didn’t accept negative offsets to slice from the end, so it wasn’t as powerful, either.

New-Style Slicing in C++ with Range-v3

First, I wanted to find a nice short-form for creating slices, so I took a page from the array_view proposal, which has a really, really clever syntax for indexing into a multi-dimensional array. Here’s an example lifted straight from the proposal:
  1. char a[3][1][4] {{{'H', 'i'}}};
  2. auto av = array_view<char, 3>{a};
  3. // the following assertions hold:
  4. assert((av.bounds() == bounds<3>{3, 1, 4}));
  5. assert((av[{0, 0, 0}] == 'H'));
Lines 1-2 declare a 3-D array of characters and then creates a 3-D view of it. Line 5 is where the magic happens. It accesses the element at the (0,0,0) position with the slightly alien-looking av[{0,0,0}] syntax. What the heck is this?!

It’s really very simple: a novel use of uniform initialization syntax. Consider this type:
  1. struct indices
  2. {
  3.     std::size_t i, j, k;
  4. };
  5. struct my_array_view
  6. {
  7.     double & operator[](indices x);
  8. };
Now I can index into a my_array_view object with the av[{0,0,0}] syntax. Neat-o!

I realized I could use this trick to give people a super-short and cute syntax for slicing ranges. Check it out:
  1. using namespace ranges;
  2. auto letters = view::iota('a','g');
  3. std::cout << letters << '\n';
  4. // prints: {a,b,c,d,e,f,g}
  5. std::cout << letters[{2,5}] << '\n';
  6. // prints: {c,d,e}
Hey, that’s not half bad!

Slicing From the End, A Dilemma

But that’s not enough. I want the handy slice-from-the-end functionality. But here’s where things get a bit … interesting … from a library design perspective. Not all range types support slicing from the end. To see what I mean, consider a range of ints read from an istream. This is an input range. You don’t know the end until you reach it, which means that you don’t know the last-minus-N element until you’re N elements past it!

In other words, the following code makes no sense:
  1. using namespace ranges;
  2. // An input range of ints read from cin
  3. auto ints = istream<int>(std::cin);
  4. // I'm sorry, I can't do that, Dave:
  5. std::cout << ints[{0,-2}] << '\n';
The istream range returned by istream totally knows at compile time that it can’t be sliced from the end. But whether the offsets are negative or positive is a runtime property, so it can’t be checked at compile time. That would make this a runtime failure. Ugh.

To make matters worse, the rules about what categories of ranges accept negative offsets are surprisingly subtle. Consider this variation of the code above:
  1. using namespace ranges;
  2. // Take the first 10 ints read from cin:
  3. auto ints = istream<int>(std::cin) | view::take(10);
  4. // This should work! It should take the first 8 ints:
  5. std::cout << ints[{0,-2}] << '\n';
In this case, we’ve taken the first 10 integers from an istream. The ints range is still an input range, but it’s a sized input range. Now we can slice from the end because we know where the end is.

And if we have a forward range, we can always slice from the end, even if we don’t know where that is (e.g. a null-terminated string), by computing the length of the sequence and then advancing distance-minus-N from the front (although that’s not always the most efficient way to do it).

And you should never specify a negative offset if the range is infinite. Never, ever, ever.

It gets even more subtle still: if both offsets are negative, or if both offsets are non-negative, then the resulting slice knows its size in O(1); otherwise, it only knows its size if the underlying range knows its size. When the O(1)-sized-ness of a range is part of the type system, it enables all sorts of optimizations. If we don’t know the sign of the offsets until runtime, we can’t ever return a type that advertises itself as sized.

My point is that the rules for when it’s OK to slice from the end are subtle — far too subtle to leave the error reporting until runtime. And doing so leaves valuable optimizations on the floor.

Slicing From the End, A Solution

The solution I came up with was to disallow negative offsets with an unconditional assert. But wait before you flame me! I added an alternate syntax for denoting an offset from the end. Check it out:
  1. using namespace ranges;
  2. auto letters = view::iota('a','g');
  3. std::cout << letters << '\n';
  4. // prints: {a,b,c,d,e,f,g}
  5. std::cout << letters[{2,end-2}] << '\n';
  6. // prints: {c,d,e}
Instead of using a negative offset, we say end-2 to mean the 2nd from the end. What is end here? It’s the same end function that you call to get the end of an Iterable (think std::end), only in my library it’s not a function; it’s a function object. (For more about why I chose to make begin and end global function objects instead of free functions, see my blog post about Customization Point Design.) Since end is an object, I can define an overloaded operator- that takes end on the left-hand-side and an int on the right. That can return an object of some type that makes the from-the-end-ness of the offset a part of the type system.
  1. struct from_end { int i; };
  2.  
  3. from_end operator-( decltype(ranges::end), int i )
  4. {
  5.     assert(i >= 0); // No funny business, please
  6.     return {i};
  7. }
Now I can define an overloaded operator[] on my range type that accepts a std::pair<int,from_end>:
  1. struct my_range
  2. {
  3.     // callable as rng[{2,end-2}]
  4.     slice_view<my_range>
  5.     operator[](std::pair<int, from_end> p)
  6.     {
  7.         // ... slicing happens here
  8.     }
  9. };
Voilà! Now I get slicing from the end with a short, readable syntax and compile-time type checking without leaving any optimization opportunities on the floor.

Yes, But…

That’s great and all, but code like “rng[{2,-2}]” still compiles and fails at runtime. How is the situation any better? The difference now is that passing a negative offset to slice is always a runtime error. There is no situation in which it will succeed and do what you want, even if the range type could conceivably support it. Users will quickly learn that that isn’t the way to do it.

Had we allowed negative offsets in a way that sometimes worked and sometimes didn’t, it makes the interface far more dangerous. Users will try it, meet with some success, and conclude incorrectly that it will always work. They’ll discover their error the hard way after their application has been deployed.

Which brings me to my Philosophy of Library Design:
I can’t keep people from writing bad code. But I’m guilty of collusion if I make it easy.
And a corollary that relates to this problem:
If you can’t make something succeed consistently, it’s better to make it fail consistently.
Hope you enjoyed this little case study in library design.
Written by Eric Niebler

If you found this post interesting, follow and support us.
Suggest for you:

Learn to Program with C++

Object Oriented Programming with C++

C++ Programming from Zero to Hero:  The Fundamentals