Self-Registration Factory Pattern in C++

Usage Scenarios

Implementing a typical factory pattern in C++ is not complicated, take the common Shape class as an example:

1
2
3
4
5
6
7
8
9
10
// ShapeFactory.cpp
#include "Shape.h" // class Shape
#include "Triangle.h" // class Triangle: public Shape
#include "Circle.h" // class Circle: public Shape

std::unique_ptr<Shape> createShape(std::string_view name) {
if (name == "triangle") return std::unique_ptr<Shape>(new Triangle());
else if (name == "circle") return std::unique_ptr<Shape>(new Circle());
else return nullptr;
}

This method is pretty straightforward and is used more, but there are two disadvantages:

  1. Each concrete class implementation must be manually registered in the ShapeFactory.cpp . Over time, this file will be longer and there will be too many if-else branches eventually;
  2. It is not easy to do isolation of function macros, and there will be nesting of preprocessor commands with poor readability after adding platform macros.

You can dynamically generate list files such as codec_list .cin FFmpeg to solve the second problem. As a cross-platform library, FFmpeg faces the very same problem of more complex functional options. Its solution is to dynamically generate this list file for as long as possible during the configuration time. Only the enabled codec will appear in the list, which looks relatively clean.

This article will introduce another way to solve the above two problems: the self-registered factory pattern.

Implementation and Principles

Self-registration exploits the global static variable or class static member automatic initialization mechanism. In the constructor of a static variable, it will register itself into the factory method implicitly. C programming language can achieve a similar effect through __attribute__ ((constructor)) .

For example, the Shape example above can be written as:

registry.h

1
2
3
4
5
6
7
template <typename ShapeType>
bool registerShape(std::string_view name) {
ShapeFactory::instance().registerShape(name, []() {
return std::unique_ptr<Shape>(new ShapeType());
});
return true;
}

circle.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class Circle: public Shape {
public:
Circle() = default;
double area() const override;

private:
double m_radius = 0;
static bool m_registered;
};

bool Circle::m_registered = registerShape<Circle>("Circle");

double Circle::area() const {
return 3.14 * m_radius * m_radius;
}

Circle::m_registered is only used to collect the return value of the registered function and ensure that it is called. The registered function call occurs before entering the main function. After entering the main function, you can directly use the factory class to create an instance.

main.cpp

1
2
3
4
5
6
7
8
int main() {
auto shape = ShapeFactory::instance().createShape("Circle");
if (shape)
std::cout << "shape created" << std::endl;
else
std::cout << "shape not found" << std::endl;
return 0;
}

The advantage of this is obvious, since the registration of the creator takes place in the Circle source file. If we decide to disable the Circle class through the feature option, just rule out this file in the build file, no need to modify the code elsewhere, and no need to set up some functional macros to isolate the code. We don’t even need to declare Circle in a header file since there is no direct reference to this class anywhere else!

This method is also highly suitable for dynamic loading plug-ins during runtime. Suppose the Circle class is designed as a plug-in, the host program only needs to invoke dlopen function to load the Circle.so, and it will register itself into the list without any additional need for read information methods.

The sample code above can be found in different branches here:

TsaiHao/SelfRegisterFactory

But this approach also inherently has some problems that are not easy to ignore. The following will introduce its shortcomings and some compromise methods.

Problems in Practice

Symbol Stripping of Static Linking

This is the most direct problem with the self-registration method, as the cost of implementing the code unitization above. For a static library, the linker only copies the object files directly used by the program during the linking phase. Therefore, the above Circle.o object file will be removed without notifying because the Circle class has no code reference other than itself!

When we have to use static linking without giving up the self-registration method, we can:

  1. Use some compile options to force the link target to depend on self-registered symbols, in clang / gcc , this option is -u, see Clang command line argument reference - Clang 17.0.0git documentation . In the above example, add an INTERFACE link option to the shape class in CMake :

    1
    target_link_options(shape INTERFACE -u__ZN6Circle12m_registeredE)

    After that, any target links libshape.a library will default rely on Circle:: m_registerd, thereby forcing the linker to use Circle.o object file.

  2. Directly make the self-registered source file participate in the compilation of the upper-level target, that is, not compress Circle.o into the libshape.a library, but directly link the object file to the upper-level target as an additional subsidiary of the library. The implementation of CMake is:

    1
    target_sources(shape INTERFACE ${CMAKE_CURRENT_LIST_DIR}/shape/impl/Circle.cpp)

    In this way, all the targets that link the shape library will directly use Circle.cpp as their own source files so that the linker will not directly delete a target file such as Circle.o.

The above two methods are highly dependent on the compilation and CMake system. If you need cross-platform you need a more general method which I have not found yet.

The Construction Order of Static Variables

Because the factory uses global variables to register, you must be careful not to rely on their order, and the relevant operations should be carried out after entering the main function as much as possible, for example:

  1. The map that holds the creator in the Factory must be a static variable in the function scope instead of the global scope;
  2. Do not create or refer to this factory in the construction of other static variables.

Who Are Using This

Clang’s plugins and various modules of clang-tidy in Project LLVM use this technique:

  1. Clang plugin, see Clang Plugins - Clang 17.0.0git documentation , dynamically loaded by dlopen at runtime, after loading, the actions in the plugin are automatically registered into the list, and there is no problem with symbol elimination.

  2. Clang-tidy module registration. A new module is registered by initializing a static variable. to solve the above problem, clang-tidy declares an extern volatile int variable corresponding to each module in a public header file, and the definition of the variable instance is distributed in the module’s source file. Because of the reference of this int variable, the corresponding module will not be removed by default.

Introduction to std::any in C++

What Is std::any

std::any is a new feature that comes with C++ 17 standard. It’s kind of void* with type-safety, supporting copy/move/store/get, etc. Some basic usages of it can be found here: std::any - cppreference

Understanding how std::any is implemented can be gainful, for it taking advantage of many c++ skills, especially templates.

Implementation of std::any

The source code referred to below is of llvm’s libcxx library, revision: any - libcxx.

Class Layout

std::any has two data members:

  1. __h_: _HandleFunPtr, it’s a function pointer that points to a static function, as well as the entry of data manipulation including constructing/destroying/copying/access, etc. Prototype of the function is:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    /*******
    * arg1: Enum, kind of the manipulation, (_Destroy, _Copy, _Move, _Get, _TypeInfo)
    * arg2: Caller's "this" pointer
    * arg3: Destination any, used in copy, move
    * arg4: Runtime type info, always nullptr if rtti is disabled
    * arg5: Fallback type info, described in next chapter
    *******/
    using _HandleFuncPtr = void* (*)(_Action, any const *, any *, const type_info *,
    const void* __fallback_info);
  2. __s_: _Storage, stores pointer to managing data. It’s declared as a union for separately handling large and small objects.

In conclusion, std::any is basically equal to an aggregate of a data block and a predefined manipulation function, which proves the famous saying Algorithms + Data Structures = Programs to some extent.

Skills of Implementation

Small objects optimization

Implementations are encouraged to avoid dynamic allocations for small objects. – cpp refrence

The data pointer __s_ is not a void* but privately declared as a such union:

1
2
3
4
5
6
using _Buffer = aligned_storage_t<3*sizeof(void*), alignment_of<void*>::value>;
union _Storage {
constexpr _Storage() : __ptr(nullptr) {}
void * __ptr;
__any_imp::_Buffer __buf;
};

In a 64-bit machine, _Storage occupies 24 bytes. __buf is equally a void*, used when the contained object is no larger than 24 bytes. Utilizing the benefits of stack memory, constructing or copying these small objects could be more effective. Larger objects, on the other hand, have to be stored on heap memory and allocated dynamically in runtime.

“24 bytes” is a curated threshold that is exactly the size of std::vector /std::string and many other STL containers in libcxx. This fact means std::any can manage these common objects faster, though the memory inside them could still be dynamic.

A similar memory optimization technology is also applied on std::string, but subtler. I will introduce it in the future.

In-place Construct

When a std::any object is copied, the object managed by it is also copied. It’s pretty straight yet important, simply memcpy is not enough because some classes have essential things to do, such as std::shared_ptr. std::anys copy object with its own copy constructor through allocator. It also applies to move construction.

But what about the constructor itself? Like emplace_back for std::vector, std::any also has an emplace-like constructor, in which the object is directly constructed on __buf instead of constructing a temporary object and then moving it.

1
2
std::any a(std::in_place_type<std::string>, "hello");
std::any b("hello");

By inspecting these 2 variables in a debugger, we can acknowledge that std::in_place_type<std::string> has two folder meanings, it tells std::any constructing a std::string instead of a const char* and constructing it directly.

Type to int mapping

Obviously, std::any is not a template class itself, but it can throw exceptions when casting it to a different static type even if RTTI(run-time type info) is disabled. The secret of this type-safety is a mapping from type to an integer.

On line 162 of any.h, a type-unique template struct is defined as:

1
2
3
4
5
6
7
8
9
10
template <class _Tp>
struct __unique_typeinfo {
static constexpr int __id = 0;
};

// get type id if rtti is disabled
template <class _Tp>
inline constexpr const void* __get_fallback_typeid() {
return &__unique_typeinfo<remove_cv_t<remove_reference_t<_Tp>>>::__id;
}

The static member __id of __unique_typeinfo is always equal to 0 but is a unique instance corresponding to type __Tp due to template specialization. Based on this, std::any gets the address of __id as a fallback type id if RTTI is disabled (compiling with flag -fno-rtti).

Best Practices

In conclusion, the best practices of std::any include:

  • To avoid unnecessary copying, use std::make_any or std::in_place_type to construct.
  • Pass std::any by reference if possible.
  • Use pointer version std::any_cast<T>(&a) to avoid copying large objects.
  • Let custom objects conform to the rule of three/five/zero if managed by std::any.

Formatting of std::any in LLDB

Both belonging to Project LLVM, LLDB does not provide a formatted display of std::any of libcxx (while GDB does with libstdc++). Printing std::any in LLDB CLI will get:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(lldb) n
Process 76235 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
frame #0: 0x0000000100003a3c Play`main at main.cpp:17:5
14 int main() {
15 std::any a = 1;
-> 16 return 0;
17 }
(lldb) fr v a
(std::any) $0 = {
__h = 0x0000000100003d2c (Play`std::__1::__any_imp::_SmallHandler<int>::__handle(std::__1::__any_imp::_Action, std::__1::any const*, std::__1::any*, std::type_info const*, void const*) at any:350)
__s = {
__ptr = 0x0000000000000001
__buf = (__lx = "\U00000001\0\0\0\0\0\0\0\U00000001\0\xc1\x89FͽV\0:\0\0\U00000001")
}
}

It takes seconds to understand “a” is an int (from _SmallHandler’s type) and its value is 0x1 (from first 4 bytes of __buf). I write a Python script as a plugin based on LLDB API to print it more intuitively.

Implementation of plugin

There are 2 functions inside this script. __lldb_init_module is the entry of this plugin. handle_std_any is the processing handle used by LLDB.

A tricky skill is catching the type name inside _SmallHandler using regex. After knowing that, we can find the object representing this type in python and then forcibly convert the pointer of buffer to it. This script should work for integers, floats, and std::string, but not very robust now. I will continually polish it.

The same type-catching trick can also be employed in c++ source code. __PRETTY_FUNCTION__ macro carries type name of a template function, so you can do some static reflections with it. A famous example is Magic Enum.

Another noticeable thing is that public classes of libcxx need special treatment because they have an inline namespace __1.