String literals: C/C++ string data types part 2
In the first part of the series on string types in LibreOffice, I discussed some of the string data types that are in use in various places of the LibreOffice code. I discussed various character and string data types briefly: OString
, OUString
, char
/char*
, sal_Unicode
, sal_Unicode*
, rtl_String
, rtl_uString
and also std::string
. Now I want to explain string literals.
String Literals
In C/C++, a string literal is a sequence of characters in double quotations, and represent read-only textual data. For example:
const char *str = "abc";
Please note that it is different from a character literal, which is a single character in single quotation marks:
const char c = 'a';
The non read-only version of these data types does not have const
in it.
The char*
data type is widely used in C programming language, but it is not the data type of choice in LibreOffice. As described in my previous post, OString
is used for for 8-byte text, and OUString
is used for Unicode text in LibreOffice. It is worth noting that it is possible to store UTF-8 encoded Unicode text in OString
.
In the past, it was possible to convert the const char*
literal to OString
/OUString
like this: (it will not compile now)
OString sText = "abc"; OUString sUniText = u"abc";
It was not an efficient way to define and use such strings. A read-only memory is used to store the plain string literals. But then, a new dynamic memory chunk is allocated on the heap to store the new O[U]String object, and through the constructor, that read-only memory is copied into that memory. Also, the new OUString
needs reference counting. These are non-necessary expensive operations, and we should avoid them.
O[U]StringLiteral
In LibreOffice, OStringLiteral
and OUStringLiteral
are the data types used to represent string literals for ASCII and Unicode data, respectively.
As an example, you can see lines like this in LibreOffice .cxx files:
static constexpr OUStringLiteral sStart = u"ABC"; static constexpr OStringLiteral sEnd("DEF");
The constexpr
ensures that the expression is evaluated at compiled time, and this can improve the performance of the program. Also, avoiding reference counting in O[U]String
helps to make the operation cheaper.
Later, OString
/OUString
variables are constructed from the OUStringLiteral
s. Or, they are passed to functions that expect OString
/OUString
parameters. The difference is that when static constexpr
literals are used, the memory used for storing data is not the dynamic memory, it is allocated once, and it is read-only, which increases the performance. This approach is only usable when you work with strings that will be only initialized once, and will not be manipulated later.
String Literals in Headers
If you are working with a .hxx
C++ header file, you have to use inline
keyword to avoid creating duplicate copies of the global variable. For example:
inline constexpr OUStringLiteral ABC(u"abc");
Later we will see that we can re-write the above with a suffix as:
inline constexpr OUString ABC = u"abc"_ustr;
Essentially, that is a better replacement of the macro:
#define ABC "abc"
or, sometimes:
const char ABC[] = "abc";
These are no longer desirable in C++ having the string literals available with the latest C++ standard and new LibreOffice code. Also, it is important to know that the goal is eventually get rid of O[U]StringLiteral data types using the simpler form with suffixes.
Prefixes
String literals with no prefix are single byte strings which consist of 8-bit characters. Multi-byte Unicode string literals have various prefixes used to indicate their types. For example, to represent ABC in ASCII, UTF-8, UTF-16, UTF-32 and wide-char, you need to write:
// requires C++20 char ascii_cstr[] = "ABC"; char8_t utf8_cstr[] = u8"ABC"; char16_t utf16_cstr[] = u"ABC"; char32_t utf32_cstr[] = U"ABC"; wchar_t w_cstr[] = L"ABC";
Suffixes
Now that C++20 has become the baseline for LibreOffice source code, and thanks to Stephan Bergmann, it became possible to simplify the code, and avoid O(U)StringLiteral
data type to write it it in a much shorter form, like:
static constexpr OUString sStr = u"abc"_ustr; static constexpr OString sTransSource("def"_ostr);
As you can see in the above code snippet, for Unicode strings, _ustr
is used, and for non-Unicode strings, _ostr
.
Since C++14 standard, you can use s
suffix to have a std C++ string out of the string literal, but you need to explicitly say that you will use the std::string_literals
namespace first.
using namespace std::string_literals; std::string ascii_str = "ABC"s; std::u8string utf8_str = u8"ABC"s; std::u16string utf16_str = u"ABC"s; std::u32string utf32_str = U"ABC"s; std::wstring wstring_str = L"ABC"s;
Final Words
Don’t be afraid of various string types that we discussed here! Most of the time, you will be using OUString. The other types will come up occasionally when you work with different parts the huge LibreOffice source code.
There are still other data types related to working with string like streams, buffers and stringview types that I will discuss in the next part of this series of blog posts.
If you want to know more, refer to the presentation from Stephan Bergmann in LibreOffice conference 2023. He talks about the improvements in C++20 (Class non-type template parameters) that made it possible to simplify the string literals in LibreOffice code:
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.