String types in LibreOffice C/C++ code – part 1

Strings are very important types of data that are using in LibreOffice. Firstly, they are useful for storing textual data, and is essentially a sequence of characters. As LibreOffice has many modules that depend on various libraries and languages, there are different string types in LibreOffice. Here, we discuss some of them.

Character and String Data Types in C++

In C++, the standard std::string is available alongside the internal LibreOffice data types.

std::string std_str = "ت";
std::cout << "std::string: " << std_str << std::endl;

The standard std::string is not the data type of choice for storing textual values, and passing them between classes and methods, because LibreOffice has its own set of data types for this purpose. See the next sections for more information.

Please note that the usual functions to work with C strings may give unexpected results when user does not account for multi-byte encodings (like utf-8). For example, the length of the utf-8 strings as reported by std::string::length() shows the correct count of bytes (code units), but not the count of Unicode codepoints or “characters”.

Characters and Strings in LibreOffice C++ Source Code

In addition to the above C string type, in C++ OString is the 8-bit-string data type in LibreOffice that does not keep information about its encoding. On the other hand, OUString uses UTF-16 encoding, and is more widely used.

Code sample

Here is a sample code snippet for working with these LibreOffice string classes in C++:

// The text is: "واحِدْ إثٍنين ثلاثةٌ" which means "one two three"
OUString aOneTwoThree(reinterpret_cast
   (u"\u0648\u0627\u062d\u0650\u062f\u0652 \u0625\u062b\u064d\u0646\u064a\u0646"
    " \u062b\u0644\u0627\u062b\u0629\u064c" ));
std::cout << "" << aOneTwoThree << std::endl;

OUString ouStr = (sal_Unicode*)u"فارسی";
std::cout << ouStr.getLength() << std::endl;

OUString sTestString = (sal_Unicode*)u"The quick brown fox\n jumped over the lazy dog العاشر";
std::cout << sTestString <buffer);

OString oStr2("پ");
std::cout << "Unicode OString: " << oStr2 << std::endl;

Character and String Data Types in C

Some small (but important) parts of LibreOffice are in C programming language. In this case, the main type is the char[] (which is of type char * with slight difference). Essentially, it is an array of 8-bit (1 byte) characters that end with the NULL byte: '\0' or the character with the code zero. The char data type itself is used to store individual 8-bit characters. It is also possible to store UTF-8 Unicode strings in C strings.

Code sample

Here is a sample code snippet for working with these data types in C:

// storing ASCII
char c='a';
printf("Non-Unicode C Character: %c\n", c);

char s[] = "عربي";
printf("Unicode UTF-8 String: %s\n", s);

setlocale(LC_ALL, "");
wchar_t w_char = L'ب';
printf("Unicode UTF-16 String: %lc\n", w_char);

Characters and Strings in LibreOffice C Source Code

The underlying Unicode character data types for LibreOffice is sal_Unicode, and the string types rtl_String / rtl_uString. They are suitable for C source codes.

Code sample

Here is a sample code snippet for working with these data types in C:

sal_Unicode ouChar = u'ب';
printf("Unicode Character: %lc\n", ouChar);

rtl_String *rtl_str = nullptr;
rtl_string_newFromStr(&rtl_str, "پ");
printf("rtl_String: %s\n", rtl_str->buffer);
rtl_string_release(rtl_str);

rtl_uString *rtl_ustr = nullptr;
rtl_uString_newFromStr(&rtl_ustr, (sal_Unicode*)u"الف");
printf("%s", "rtl_uString: ");

Characters and Strings in Windows

For handling Unicode characters, we use wide characters in Windows. The wide character type wchar_t and wchar_t[] strings are based on it. The C++ versions of this string types is std::wstring.

This code is Windows-specific:

//Windows specific
#ifdef _WIN32
std::wstring w_string = L"الف";
wprintf("std::wstring: %ls\n", w_string.c_str());
printf("%ls\n", rtl_ustr->buffer);
#else
for(int i=0; ibuffer); ++i)
  printf("%lc", rtl_ustr->buffer[i]);
#endif
rtl_uString_release(rtl_ustr);

Please note that this code snippet is the continuation of the above code.

Characters and Strings in Qt

As LibreOffice provides Qt UI, there is a need to work with Qt data types. Specifically, QString is the string data type provided by the Qt library. The QString class provides a rich set of functions that are very useful to store and manipulate textual data in C++ applications that use Qt library.

For more information, refer to the QString page in the Qt 6 documentation:

https://doc.qt.io/qt-6/qstring.html

Code Sample

QString q_str = "ABC ا ب پ ت";
qDebug() << "QString: " << q_str;
qDebug() << q_str.length();

Additionally, LibreOffice provides GTK UI, thus there is also a need to work with GTK data type in the relevant source files. Specifically, the character data type used in the LibreOffice is the gchar, and the string data type gchar *.

Also, GString (GLib) is the struct suitable for storing and manipulating textual data. You can see its structure and utility functions in the glib manual:

https://docs.gtk.org/glib/struct.String.html

Code Sample

Here is a sample code, gchar.c:

#include 

int main(int argc, char *argv[])
{
    gchar *string1 = "Test";
    g_print("%s\n", string1);

    GString *string2 = g_string_new ("Hello");
    g_print("%lu\n", string2->len);
    g_print("%lu\n", strlen (string2->str));
    g_string_free(string2, TRUE);

    return 0;
}

You can compile it with:

gcc gchar.c -o gchar `pkg-config --cflags --libs glib-2.0`

Refactoring String Types

Not all the possible string data types are desirable. These are some of the refactoring done:

// This can be replaced with std::string
// https://gerrit.libreoffice.org/c/core/+/112980
std::unique_ptr<char[]> m_pFileName;

It is now converted to the std::string:

std::string m_sFileName;

There are situations where you have to pass a C string to a function in order to get some textual data from a C function. In such cases that changing the data is needed, you can use std::vector instead. For example:

std::vector vectorChar(10);
strncpy(vectorChar.data(), "ABCDEFGHIJ", 11);
printf("%s\n", vectorChar.data());

String Literals, Streams, Buffers and String View Types

These are the classes for the literals used in LibreOffice:

OStringLiteral
OUStringLiteral
QStringLiteral

These are the streams and buffers classes useful for creating temporary object for string manipulation:

std::stringstream
OStringBuffer
OUStringBuffer
QStringBuilder

At last, these are some of the stringview types used in the LibreOffice:

std::string_view
std::u16string_view

We will discuss about these types in the next blog posts.