Recursive Descriptions of Polar Codes
Noam Presman and Simon Litsyn*

Abstract
Polar codes are recursive general concatenated codes. This property motivates a recursive formalization of the known decoding algorithms: Successive Cancellation, Successive Cancellation with Lists and Belief Propagation. Using such description allows an easy development of these algorithms for arbitrary polarizing kernels. Hardware architectures for these decoding algorithms are also described in a recursive way, both for Arikan’s standard polar codes and for arbitrary polarizing kernels.

1 Introduction
Polar codes were introduced by Arikan [1] and provided a scheme for achieving the symmetric capacity of binary memoryless channels (B-MC) with polynomial encoding and decoding complexities. Arikan used the so-called \( (u + v, v) \) construction, which is based on the following linear kernel

\[
G_2 = \begin{bmatrix}
1 & 0 \\
1 & 1
\end{bmatrix}.
\]

In this scheme, a \( 2^n \times 2^n \) matrix, \( G_2 \otimes n \), is generated by performing the Kronecker power on \( G_2 \). An input vector \( u \) of length \( N = 2^n \) is transformed into an \( N \) length vector \( x \) by multiplying a certain permutation of the vector \( u \) by \( G_2 \otimes n \). The vector \( x \) is transmitted over \( N \) independent copies of the memoryless channel, \( W \). This results in new \( N \) (dependent) channels between the individual components of \( u \) and the outputs of the channels. Arikan showed that these channels exhibit the phenomenon of polarization under Successive Cancellation (SC) decoding. This means that as \( n \) grows, there is a proportion of \( I(W) \) (the symmetric channel capacity) of the channels that become clean channels (i.e. having the capacity approaching 1) and the rest of the channels become completely noisy (i.e. with the capacity approaching 0). Arikan showed that the SC decoding algorithm has an algorithmic time and space complexity which is \( O(N \cdot \log(N)) \) (the same asymptotic complexities apply also for the encoding algorithm). Furthermore, it was shown [2] that asymptotically in the block length \( N \), the block error probability of this scheme decays to zero like \( O(2^{-\sqrt{N}}) \).

Generalizations of Arikan’s code structures were soon to follow. Korada et al. considered binary and linear kernels [3]. They showed that a binary linear kernel is polarizing if and only if there does not exist a column permutation of its generating matrix which is upper-triangular, and analyzed its rate of polarization, by introducing the notion of the kernel exponent. Mori and Tanaka considered the general case of a mapping \( g(\cdot) \), which is not necessarily linear and binary, as a basis for channel polarization constructions [4]. They gave sufficient conditions for polarization and generalized the exponent for these cases. They further showed examples of linear and non-binary Reed-Solomon codes and Algebraic Geometry codes with exponents that are far better than the exponents of the known binary kernels [5]. The authors of this correspondence gave examples of binary but non-linear kernels having the optimal exponent per their kernel dimensions [6, 7].

*Noam Presman and Simon Litsyn are with the the School of Electrical Engineering, Tel Aviv University, Ramat Aviv 69978 Israel. (e-mails: {presmann, litsyn}@eng.tau.ac.il).
All of the aforementioned polar code structures have homogenous kernels, meaning that the alphabet of their inputs and their outputs are the same. The authors of this correspondence considered the case that some of the inputs of a kernel may have different alphabet than the rest of the inputs. This results in the so-called mixed-kernels structure, that have demonstrated good performance for finite length codes in many cases. A further generalization of the polar code structure was suggested by Trifonov [9], in which the outer polar codes were replaced by suitable codes along with their appropriate decoding algorithms. We note here, that the representation of polar codes as instances of general concatenated codes (GCC) is fundamentally to this correspondence, and we elaborate on it in the sequel.

Generalizations and alternatives to SC as the decoding algorithm were also extensively studied. Tal and Vardy introduced the Successive Cancellation List (SCL) decoder [10, 11]. In this algorithm, the decoder considers up to \( L \) concurrent decoding alternatives on each one of its stages, where \( L \) is the size of the list. At the final stage of the algorithm, the most likely result is selected from the list. The asymptotic time and space complexities of this decoder are the same as those of the standard SC algorithm, multiplied by \( L \). Furthermore, incorporation of a cyclic redundancy check code (CRC) as an outer-code, results in a scheme with an excellent error-correcting performance, which in many cases is comparable with state of the art schemes (see e.g. [11, Section V]).

Belief-Propagation is an alternative to the SC decoding algorithm. This is a message passing iterative decoding algorithm that operates on the normal factor graph representation of the code. It is known to outperform SC over the Binary Erasure Channel (BEC) [12] and seems to have good performance on other channels as well [12, 13].

Leroux et al. considered efficient hardware implementations for the SC decoder of the \((u + v, v)\) polar code [14, 15]. They gave an explicit design of a "line decoder" with \( N/2 \) processing elements and \( O(N) \) memory elements. Their work, contains an efficient approximate min-sum decoder, and a discussion on a fixed point implementation. Their design is verified by an ASIC synthesis. Efficient limited parallelism decoders were considered by Leroux et al. [14] and by Pamuk and Arikan [17]. Hardware implementation of SCL decoder was discussed in Balatsoukas-Stimming et al. papers [18, 19]. Pamuk considered a hardware design of BP decoder tailored for an FPGA implementation [20].

The goal of this paper is to emphasize the formalization of polar codes as recursive GCCs and the implication of this property on the encoding and decoding algorithms. The main contributions of this manuscript are as follows: 1) Formalizing Tal and Vardy’s SCL as a recursive algorithm, and thereby generalizing it to arbitrary kernels. 2) Formalizing Leroux et al. SC line decoder and generalizing it to arbitrary kernels. 3) Defining a BP decoder with GCC schedule, and suggesting a BP line architecture for it.

The paper is organized as follows. In Section 2 we describe polar code kernels as the generating building blocks of polar codes. We then elaborate on the fact that polar codes are examples of recursive GCC structures. This fundamental notion, is the motivation for formalizing the encoding and decoding algorithms in a recursive fashion in Sections 3 and 4 respectively. In particular, we study the standard SC, the SCL (both for Arikan’s kernels and arbitrary ones) and BP (for linear lower triangular kernels) decoding algorithms. These formalizations lay the ground for schematic architectures of the decoding algorithms in Subsection 5.1. Specifically, we restate Leroux et al. SC pipeline and SC line decoders, and introduce a line decoder for the GCC schedule of the BP algorithm. Finally, in Subsection 5.2 we consider generalizations of these architectures for arbitrary kernels.

2 Preliminaries

Throughout we use the following notations. For a natural number \( \ell \), we denote \([\ell]=\{1,2,\ldots,\ell\}\) and \([\ell]=\{0,1,2,\ldots,\ell-1\}\). We represent vectors by bold letters. For \( i \geq j \), let \( u_j = [u_j, u_{j+1}, \ldots, u_0] \) be the sub-vector of \( u \) of length \( i-j+1 \) (if \( i < j \) we say that \( u_j = [\ ] \), the empty vector, and its length is 0). For two vectors \( u \) and \( v \) of lengths \( n_u \) and \( n_v \), we denote the \( n_u + n_v \) length vector which is the concatenation of \( u \) to \( v \) by \([u, v]\) or \([u\ v]\) or \(u\ v\) or just \(uv\). For a scalar \( x \), the \( n_u + 1 \) length vector \( u\ x \), is just the
concatenation of the vector $\mathbf{u}$ with the length one vector containing $x$. Matrices are denoted by boldface capital letters. We denote the set of all the matrices of $n_1$ rows and $n_2$ columns over a field $F$ by $F^{n_1 \times n_2}$. Let $A \in F^{n_1 \times n_2}$. We denote row $i$ (column $j$) of the matrix by $A_{i \rightarrow} (A_{i,j})$. The element at row $i$ and column $j$ is denoted by $A_{i,j}$. The sub-matrix containing only rows $i_1 \leq i \leq i_2$ and columns $j_1 \leq j \leq j_2$ is denoted as $A_{i_1:i_2, j_1:j_2}$.

In this paper we consider kernels that are based on bijective transformations over a field $F$. A channel polarization kernel of $\ell$ dimensions, denoted by $g(\cdot)$, is a mapping

$$g : F^\ell \to F^\ell.$$ 

This means that $g(u) = x$, $u, x \in F^\ell$.

We refer to this type of kernel as a homogeneous kernel, because its $\ell$ input coordinates and $\ell$ output coordinate are from the same alphabet $F$. Symbols from an alphabet $F$ are called $F$-symbols in this paper. The homogenous kernel $g(\cdot)$ may generate a polar code of length $\ell^m$ $F$-symbols by inducing a larger mapping from it, in the following way [4].

**Definition 1 (Homogenous Polar Code Generation)** Given an $\ell$ dimensions transformation $g(\cdot)$, we construct a mapping $g^{(n)}(\cdot)$ of $N = \ell^n$ dimensions (i.e. $g^{(n)}(\cdot) : F^{\ell^n} \to F^{\ell^n}$) in the following recursive fashion.

$$g^{(1)}(u_0^{(1)}) = g(u_0^{(1)});$$

for $n > 1$, $g^{(n)} = \left[ g(\gamma_{0,0}, \gamma_{1,0}, \gamma_{2,0}, \ldots, \gamma_{\ell-1,0}), \right.$

$$g(\gamma_{0,1}, \gamma_{1,1}, \gamma_{2,1}, \ldots, \gamma_{\ell-1,1}), \ldots,$$

$$g(\gamma_{0,N/\ell}, \gamma_{1,N/\ell}, \gamma_{2,N/\ell}, \ldots, \gamma_{\ell-1,N/\ell}) \right],$$

where

$$\left[ \gamma_i,j \right]_{i,j=0}^{\ell^n-1} = g^{(n-1)}(u_{i-1,(N/\ell)-1})^{(N/\ell)-1}, \quad i \in [\ell]_-. $$

### 2.1 Polar Codes as Recursive General Concatenated Codes

General Concatenated Codes (GCC) are error correcting codes that are constructed by a technique, which was introduced by Blokh and Zyabolov and Zinoviev. In this construction, we have $\ell$ outer-codes $\{C_i\}_{i=0}^{\ell-1}$, where $C_i$ is an $N_{out}$ length code of size $M_i$ over alphabet $F_i$. We also have an inner-code of length $N_{in}$ and size $\prod_{i=0}^{\ell-1} |F_i|$ over alphabet $F$, with a nested encoding function $\phi : F_0 \times F_1 \times \ldots \times F_{\ell-1} \to F^{N_{in}}$. The GCC that is generated by these components is a code of length $N_{out} \cdot N_{in}$ symbols and of size $\prod_{i=0}^{\ell-1} M_i$. It is created by taking an $\ell \times N_{out}$ matrix, in which the $i$th row is a codeword from $C_i$, and applying the inner mapping $\phi$ on each of the $N_{out}$ columns of the matrix. As Dumer describes in his survey, GCCs can give good code parameters for short length codes when using appropriate combinations of outer-codes and a nested inner-code. In fact, some of them give the best parameters known. Moreover, decoding algorithms may utilize their structure by performing local decoding steps on the outer-codes and utilizing the inner-code layer for exchanging decisions between the outer-codes.

As Arikan already noted, polar codes are instances of recursive GCCs [1, Section I.D]. This observation is useful as it allows to formalize the construction of large length polar code as a concatenation of several smaller length polar codes (outer-codes) by using a kernel mapping (an inner-code). Therefore, applying this notion to Definition 1 we observe that a polar code of length $\ell^m$ symbols, may be regarded as a collection of $\ell$ outer polar codes of length $\ell^{m-1}$ (the $i$th outer-code is $[\gamma_{i,j}]_{j=0}^{\ell^{m-1}} = g^{(n-1)}(u_{i-1,\ell-1}^{(N/\ell)-1})$ for $i \in [\ell]_-$). These codes are then joined together by employing an inner-code (defined by the kernel [23].
Figure 1: A GCC representation of a polar code of length \(\ell^n\) symbols constructed by a homogenous kernel according to Definition 1.

The above GCC formalization is illustrated in Figure 1. In this figure, we see the \(\ell\) outer-code codewords of length \(\ell^{m-1}\) depicted as gray horizontal rectangles (similar to rows of a matrix). The instances of the inner-codeword mapping are depicted as vertical rectangles that are located on top of the gray outer-codes rows (resembling columns of a matrix). This is appropriate, as this mapping operates on columns of the matrix which rows are the outer-code codewords. Note that for brevity we only drew three instances of the inner mapping, but there should be \(\ell^{m-1}\) instances of it, one for each column of this matrix. In the homogenous case, the outer-codes themselves are constructed in the same manner. Note, however, that even though these outer-codes have the same structure, they form different codes in the general case. The reason is that they may have different sets of frozen symbols.

Example 1 (Arikan’s Construction) Let \(g(u_0, u_1) = [u_0 \ u_1] \cdot G_2\). Let \(u\) be an \(N = 2^n\) length binary vector. The vector \(u\) is transformed into an \(N\) length vector \(x\) by using a bijective mapping \(g^{(n)}(\cdot): \{0, 1\}^N \rightarrow \{0, 1\}^N\). The transformation is defined recursively as

\[
\text{for } n = 1 \quad g^{(1)}(u) = g(u) = [u_0 + u_1, u_1], \\
\text{for } n > 1 \quad g^{(n)}(u) = x_0^{N-1},
\]

where \([x_{2j}, x_{2j+1}] = [\gamma_{0,j} + \gamma_{1,j}, \ \gamma_{1,j}]\) for \(j \in [N/2]_\), and \([\gamma_{0,j}]_{j=0}^{N/2-1} = g^{(n-1)}(u_0^{N/2-1}), [\gamma_{1,j}]_{j=0}^{N/2-1} = g^{(n-1)}(u_{N/2}^{N-1})\) are the two outer-codes (each one of length \(N/2\) bits). Figure 3 depicts the GCC block diagram for this example.

The GCC structure of polar codes can be also represented by a layered Forney’s normal factor graph. Layer \#0 of this graph contains the inner mappings (represented as sets of vertices), and therefore

In a layered graph, the vertices set can be partitioned into a sequence of sub-sets called layers and denoted by \(L_0, L_1, \cdots, L_{k-1}\). The edges of the graph connect only vertices within the layer or in successive layers.
we refer to it as the inner-layer. Layer #1 contains the vertices of the inner layers of all the outer-codes that are concatenated by layer #0. We may continue and generate layer #i by considering the outer-codes that are concatenated by layer #(i − 1) and include in this layer all the vertices describing their inner mappings. This recursive construction process may continue until we reach to outer-codes that cannot be decomposed into non-trivial inner-codes and outer-codes. Edges (representing variables) connect between outputs of the outer-codes to the input of the inner mappings. This representation can be viewed as observing the GCC structure in Figure 1 from its side.

Example 2 (Layered Normal Factor Graph for Arikan’s Construction) Figures and 3 and 4 depict a layered factor graph representation for length $N = 2^n$ symbols polar code with kernel of $\ell = 2$ dimensions. Figure 3 gives only a block structure of the graph, in which we have the two outer-codes of length $N/2$ that are connected by the inner layer (note the similarities to the GCC block diagram in Figure 2). Half edges represent the inputs $u_{N-1}^0$ and the outputs $x_{N-1}^0$ of the transformation. The edges (denoted by $\gamma_{i,j}$, $j \in [N/2]$, $i \in [2]$) connect the outputs of the two outer-codes to the inputs of the inner mapping blocks, $g(\cdot)$. A more elaborated version of this figure is given in Figure 4, in which we unfolded the recursive construction.

Strictly speaking, the green blocks that represent the $g(\cdot)$ inner-mapping are themselves factor graphs (i.e. collections of vertices and edges). An example of a normal factor graph specifying such a block is given in Figure 5 for Arikan’s $(u + v, v)$ construction (see Example 1). Vertex $a_0$ represents a parity constraint and vertex $e_1$ represents an equivalence constraint. The half edges $u_0, u_1$ represent the inputs of the mapping, and the half edges $x_0, x_1$ represent its outputs. This graphical structures is probably the most popular visual representation of polar codes (see e.g. [1, Figure 12] and [26, Figure 5.2]) and is also known as the ”butterflies” graph because of the edges arrangement in Figure 4.

2.2 Mixed-Kernels Polar Codes

Thus far, we described homogenous kernels constructions in which a single kernel and code alphabet is used for generating the polar codes structures. It may be advantageous in terms of error-correction performance and complexity to combine two types of kernels (each one over different alphabet) into one structure. Such constructions are called mixed-kernels structures [8, 27]. In order to have a more comprehensive introduction to the notion of mixed-kernels we give an example of the structure (taken from [8, 27]).
Figure 3: Representation of a polar code with kernel of $\ell = 2$ dimensions as a layered factor graph.

Figure 4: Representation of a polar code with kernel of $\ell = 2$ dimensions as a layered factor graph (detailed version of Figure 3 - recursion unfolded).
Example 3 (Mixed-Kernels Construction) Let $g(\cdot)$ be a four dimensions binary mapping defined as $g(u) = u \cdot G_2^{\otimes 2}$. Using $g(\cdot)$ we define an additional kernel

$$g_0(u_0, u_{(1,2)}, u_3) \triangleq g(u_0, u_1, u_2, u_3),$$

where $u_{(1,2)} \triangleq [u_1, u_2] \in \{0,1\}^2$. (2)

In other words we take the $u_1$ and $u_2$ binary inputs to $g(\cdot)$ and combine them into a single quaternary entity $u_{(1,2)}$. We informally say that $u_1$ and $u_2$ were glued together generating $u_{(1,2)}$.

Let $g_1(\cdot): \{(0,1)^2\}^4 \rightarrow \{(0,1)^2\}^4$ be a polarizing kernel over the quaternary alphabet. For example, $g_1(\cdot)$ can be a kernel, based on the extended Reed-Solomon code of length four, $G_{RS}(4)$ that was proven by Mori and Tanaka [28, Example 20] to be a polarizing kernel. The homogenous polar code generated by $g_1(\cdot)$ is dubbed the RS4 polar code. Using $g_1(\cdot)$, we can extend the mapping of $g_0(\cdot)$ to a length $N = 4^n$ bits code. Both $g_0(\cdot)$ and $g_1(\cdot)$ are referred to as the constituent kernels of the construction. Note that $g_1(\cdot)$ is introduced in order to handle the glued bits $u_{(1,2)}$ of the input of $g_0(\cdot)$ and therefore is also referred to as the auxiliary kernel of the construction. The standard Arikan’s construction (based on the Kronecker power) does not suffice here, because of the glued bits $u_{(1,2)}$, that need to be jointly treated as a quaternary symbol.

The mixed-kernels construction can be readily explained in terms of GCC structure. Let $g^{(1)}(\cdot) = g_0(\cdot)$. In order to extend this construction to a mapping $g^{(n)}(u_0^{4^n-1})$, $n > 1$ for which some of the inputs are glued, we suggest the following recursive GCC construction. We define three outer-code:

**outer-code #0:**

$$[\gamma_{0,j}]_{j=0}^{N/4-1} = g^{(n-1)}(u_0^{N/4-1}), \quad u_{j, \gamma_{0,j}} \in \{0,1\}, \quad j \in \lbrack N/4 \rbrack;$$

**outer-code #1:**

$$[\gamma_{1,j}]_{j=0}^{N/4-1} = g^{(n-1)}(u_0^{N/4-1} \cdot [u(N/4+2j,N/4+2j+1)]_{j=0}^{N/4-1}), \quad u_{j+3N/4, \gamma_{1,j}} \in \{0,1\}^2, \quad j \in \lbrack N/4 \rbrack;$$

**outer-code #2:**

$$[\gamma_{2,j}]_{j=0}^{N/4-1} = g^{(n-1)}(u_0^{N-1}), \quad u_{j+3N/4, \gamma_{2,j}} \in \{0,1\}, \quad j \in \lbrack N/4 \rbrack,$$

where $u_{i,j}$ means that the items of sub-vector $u_i^j$ were glued together, generating an element from the larger alphabet $\{0,1\}^{j-i+1}$. Note that outer-codes #0 and #2 are just mixed-kernels constructions of length $N/4$ bits. The output of these outer-codes are binary vectors, but the input is a mixture of binary and quaternary symbols (generated by bits that were glued together). Outer-code #1 is a homogenous polar code construction of length $N/4$ quaternary symbols, that all of its input symbols and output symbols are bits that were glued together in pairs. Finally, these three outer-codes are combined together using the $g_0(\cdot)$ inner mapping.

$$g^{(n)} = \left[ g_0(\gamma_{0,0}, \gamma_{1,0}, \gamma_{2,0}), \ g_0(\gamma_{0,1}, \gamma_{1,1}, \gamma_{2,1}), \ldots, \ g_0(\gamma_{0,N/4-1}, \gamma_{1,N/4-1}, \gamma_{2,N/4-1}) \right].$$

Figure 7 depicts this GCC construction. Note that outer-code #1 was drawn as a rectangle having the same width of outer-code #0 (or #2). This property symbolizes that all the outer-codes have the same width.
Figure 6: A GCC representation of the length $N = 4^n$ bits mixed-kernels polar code $g^{(n)}(\cdot)$ described in Example 3.

length in terms of symbols. On the other hand, the height of the rectangle of outer-code #1 is twice the height of each of the rectangles of the other two outer-codes. This property indicates that the symbols alphabet size of outer-code #1 is twice the size of the symbols alphabet of the other outer-codes (for which the symbols are bits). This is because outer-code #1 is a quaternary mapping in which both the input symbols and the output symbols are pairs of glued bits.

The recursive GCC structure of polar codes enables recursive formalizations of the algorithms associated with them. These algorithms benefit from simple and clear descriptions, which support elegant analysis. Furthermore, in some cases it allows reuse of resources and indicates which operations may be done in parallel. The essence of the recursive encoding algorithm has already been described in Definition 1. In Section 3 we formalize these ideas and in addition describe an algorithm for systematic encoding of polar codes. Afterwards, we consider the decoding algorithms of polar codes, giving them a recursive formulation in Section 4.

3 Recursive Descriptions of Polar Codes Encoding Algorithms

In this section we discuss encoding algorithms for polar codes. We begin in Subsection 3.1 by describing a non-systematic encoding algorithm that is a direct consequence of the GCC structure discussed in Subsection 2.1. Subsection 3.2 considers systematic encoding algorithm of linear polar codes with lower triangular kernel generating matrix.

3.1 Non-Systematic Encoding

In this subsection we consider a non-systematic recursive encoding algorithm that is based on the recursive GCC structures of polar codes. Let us begin by describing the algorithm for Arikan’s \((u + v, v)\) polar code. Let \(u\) be an \(N\) length binary vector, serving as the encoder input. Each polar code of length \(N\) is defined by its \(N\) length frozen-indicator vector \(z\), such that \(z_i = 1\) if and only if the \(i^{th}\) input of the encoder is frozen (i.e. fixed and known to both of the encoder and the decoder) and \(z_i = 0\) otherwise. For a \((u + v, v)\) polar code of dimension \(k\), we have that \(w_H(z) = N - k\), where \(w_H(\cdot)\) is the Hamming weight of the vector. Given an information vector \(\tilde{u} \in \{0,1\}^k\), it is the role of the encoder to output a binary codeword \(x \in \{0,1\}^N\) representing its corresponding codeword. Given an information vector \(\tilde{u}\) and \(z\), it is easy to generate \(u \in \{0,1\}^n\), the encoder input, such that values of \(\tilde{u}\) are sequentially assigned to
the non-frozen components of \( u \) and elements corresponding to frozen indices are set to a predetermined value (here, we arbitrarily decided to set the frozen values of \( u \) to zero), i.e.

\[
\text{if } z_i = 1 \text{ then } u_i = 0, \quad \forall i \in [N];
\]

\[
u_\theta_i = \bar{u}_i, \quad \forall i \in [k],
\]

where \( \theta \) is a \( k \) length vector, such that \( \theta_i \) is the \( i \)th index of \( z \) corresponding to zero value (i.e. indicating a non-frozen input symbol). The signature of the non-systematic encoding algorithm for \( N = 2^n \) code is

\[
x = \text{NonSysEncoder}(u).
\]

Algorithm 1 describes a recursive implementation of the encoder. Note that for a scalar input \( u \) to (4) (i.e. \( n = 0 \)) we have the output \( x \) equal to \( u \).

Algorithm 1 Non-Systematic Encoder for \((u + v, v)\) Polar Code, of Length \( N = 2^n \) Bits, \( n \geq 1 \)

\begin{itemize}
  \item \textbf{Input:} \( u \).
  \item \textbf{Initialization:}
    \begin{itemize}
      \item Allocate two binary vectors \( x^{(0)} \) and \( x^{(1)} \) each one of length \( N/2 \).
    \end{itemize}
  \item \textbf{Encode the Outer-Codes:}
    \begin{itemize}
      \item Encode the two outer-codes of length \( N/2 \) using the information sub-vectors \( u_{N/2} \) and \( u_{N-1} \):
        \[
x^{(i)} = \text{NonSysEncoder}(u_{i,N/2}), \quad \forall i \in \{0,1\}.
        \]
    \end{itemize}
  \item \textbf{Encode the Inner-Code:}
    \begin{itemize}
      \item Apply the inner-code \((u + v, v)\) on the pairs \( [x^{(0)}_j, x^{(1)}_j] \):
        \[
x^{2j+1}_j = [x^{(0)}_j + x^{(1)}_j, x^{(1)}_j], \quad \forall j \in [N/2].
        \]
    \end{itemize}
\end{itemize}

\begin{itemize}
  \item \textbf{Output:} \( x \).
\end{itemize}

Let us now consider a general kernel \( g(\cdot) \) of \( \ell \) dimensions over a field \( F \), i.e. \( g : F^\ell \to F^\ell \). The signature of the encoder remains the same, only that both \( u \) and \( x \) are in \( F^N \). Algorithm 2 describes the encoding procedure for this case. Similarly to the \((u + v, v)\) case, the function has its output equal to its input for scalar inputs.

Encoding of mixed-kernels is performed in a similar fashion. The difference is that in the outer-code encoding phase we need to provide information sub-vectors of different lengths. Let us consider the mixed-kernels instance given in Example 3. We have three computations of outer-codes

\[
x^{(0)} = \text{NonSysEncoder}(u_{N/4}^{N/4-1}); \quad x^{(1)} = \text{NonSysEncoder}^{(RS4)}(u_{N/4}^{3N/4-1}); \quad x^{(2)} = \text{NonSysEncoder}(u_{3N/4}),
\]

where \( x^{(0)}, x^{(2)} \in \{0,1\}^{N/4} \) and \( x^{(1)} \in \{0,1\}^{N/2} \). The function \( \text{NonSysEncoder}^{(RS4)} \) is the encoding procedure of the homogenous \( RS4 \) code which input and output are \( GF(4) \) vectors. The elements of \( GF(4) \) are represented by their binary vector (cartesian) form.

\[9\]
Algorithm 2 Non-Systematic Encoder for Homogenous Polar Code of Length \( N = \ell^nF \)-Symbols, Based on Kernel \( g(\cdot) \), \( n \geq 1 \)

[▷▷▷] Input: \( u \).

//Initialization:
\[ \triangleright \] Allocate \( \ell \) vectors \( \{ x^{(i)} \}_{i=0}^{\ell} \), each one of length \( N/\ell \) \( F \)-symbols.

//Encode the Outer-Codes:
\[ \triangleright \] Encode the \( \ell \) outer-codes of length \( N/\ell \) using the information sub-vectors \( \{ u_i^{(i+1):N/\ell-1} \}_{i \in [\ell]} : \)
\[
x^{(i)} = \text{NonSysEncoder} \left( u_i^{(i+1):N/\ell-1} \right), \quad \forall i \in [\ell].
\] (7)

//Encode the Inner-Code:
\[ \triangleright \] Apply the inner-code \( g(\cdot) \) on the sub-vectors \( x_j^{(i)}, \forall j \in [N/\ell] : \)
\[
x_j^{(j+1):\ell-1} = g \left( x_j^{(0)}, x_j^{(1)}, \ldots, x_j^{(\ell-1)} \right), \quad \forall j \in [N/\ell].
\] (8)

[特色社会] Output: \( x \).

3.2 Systematic Encoding

In this subsection we consider systematic encoding of polar codes. A systematic encoder has the property that the non-frozen symbols of the encoder input vector, \( u \), appear explicitly in their corresponding codeword, \( x \). Formally speaking, for a length \( N \) code, we define a bijective mapping function \( m_N(\cdot) : [N]_- \rightarrow [N]_- \), such that a systematic encoder corresponding to \( m_N(\cdot) \) outputs \( x \), satisfying \( u_t = x_{m_N(t)} \) for all non-frozen indices \( t \in [N]_- \) (i.e. \( z_t = 0 \)). A systematic encoder is advantageous because it facilitates retrieval of the user information without performing a decoding first (assuming no errors occurred in the received codeword). Furthermore, Arikan demonstrated by simulations systematic coding systems having better BER performance compared to non-systematic coding systems using the same \( (u + v, v) \) polar code [29].

In this paper we consider systematic encoders for linear kernels having a lower triangular generating matrix \( G \in F^{\ell \times \ell} \). The signature of the systematic encoder is defined as follows:
\[
[x, \, \tilde{u}] = \text{SysEncoder} (u, \, z),
\] (10)
where the vectors \( x \), \( u \) and \( z \) were defined before in Subsection 3.1 and \( x \) is a systematic encoding of \( u \). The vector \( \tilde{u} \in F^N \) is the input for the non-systematic encoder that results in \( x \), i.e. \( x = \text{NonSysEncoder} (\tilde{u}) \) and \( \tilde{u}_t = 0 \) if \( z_t = 1 \) for all \( t \in [N]_- \). While not being a necessary output of the algorithm, \( \tilde{u} \) is used here to enable a more comprehensible description of the systematic encoder. Indeed, the systematic encoder may be understood as an algorithm for finding the vector \( \tilde{u} \) meeting these requirements.

Let us first consider the \( N = \ell \) \( F \)-symbols case. In this case we have to find \( \tilde{u} \) such that \( \tilde{u} \cdot G = x \), and
\[
\forall t \in [\ell], \quad u_t = x_{m_\ell(t)} \text{ if } z_t = 0 \text{ and otherwise } \tilde{u}_t = 0.
\] (11)
For this base case we take \( m_\ell(\cdot) \) to be the identity function, i.e. \( m_\ell(t) = t, \forall t \in [\ell]_- \). Algorithm 3 describes the systematic encoding procedure for this case. It can be easily shown by induction on the for-loop variable \( j \) (beginning with \( \ell - 1 \) and ending with 0) that on each step condition (11) is met.
Algorithm 3 Systematic Encoder for Homogenous Length $N = \ell$ F-Symbols Polar Code, Based on Lower Triangular Kernel $G \in F^{(\ell \times \ell)}$

>>> Input: $u; z$.  
//Initializations: 
▷ Allocate two vectors $\hat{u}, x \in F^\ell$. Initialize $x = 0$.  
//Successively encode $u$:  
▷ For $j = \ell - 1$ to 0 Do  
  • If $z_j == 0$ Then set $\hat{u}_j = G_{j,j}^{-1} \cdot (u_j - x_j)$; Else set $\hat{u}_j = 0$;  
  • Set $x = x + \hat{u}_j \cdot G_{j\rightarrow}$;  
\\[\textbf{Output:} x; \hat{u}].

For the general $N = \ell^n$ case where $n > 1$ we utilize the GCC structure of the polar code in order to perform systematic encoding. Let us first describe the indices mapping function $m_{N}(\cdot)$. As was already noted in the GCC discussion, and was exemplified in the non-systematic encoder (Algorithm 2), the input sub-vector $u_\ell^{i:N/\ell-1}$ is also the input of outer-code $C_i$ for all $i \in [\ell]$. The following requirement of the mapping function will prove useful in the recursive implementation.

$$m_N(t) \equiv \left\lfloor \frac{t}{N/\ell} \right\rfloor (mod \ \ell) \ \forall t \in [N]_-, \ \forall N = \ell^n. \ \ (12)$$

The implication of (12) is that non-frozen symbols placed at index $t$, such that $b \cdot N/\ell \leq t < (b+1) \cdot N/\ell$, where $b \in [\ell]_-$ should appear at the output $x_\tau$ where $\tau = a \cdot \ell + b$ and $a$ is some number in $[N/\ell]_-$. Note that index $t$ of the input corresponds to the inputs of outer-code $C_b$. Furthermore, if $x^{(i)}$ is the outer-code codeword of $C_i$ we have $x_\tau = \sum_{i=b}^{i-1} G_{i,b} \cdot x^{(i)}_\tau$ (see Figure 1). This connection is useful, because if we already systematically encoded all the inputs $t'$ such that $(b+1) \cdot N/\ell \leq t'$ (corresponding to outer-codes $C_{b'}$ where $b' \geq b+1$), by appropriately calling the systematic encoder of $C_b$ we can ensure that indeed $x_\tau = u_t$.

It can be proven by induction that a mapping function implementing the following recursion formula indeed satisfies (12):

$$m_{\ell^n}(t) = \ell \cdot m_{\ell^{n-1}}(R_{\ell^{n-1}}[t]) + \left\lfloor \frac{t}{\ell^{n-1}} \right\rfloor, \ \forall t \in [\ell^n]_-; \ \ \ (13)$$

where $R_\beta(\alpha)$ is the remainder of $\alpha$ divided by $\beta$. Note that according to this definition, $m_\ell(\cdot)$ is a base $\ell$ reversal function, i.e. $m_\ell(t)$ has its base $\ell$ representation being equal to the base $\ell$ representation of $t$ given in reverse order (for $\ell = 2$ this transformation is also known as the reverse shuffle operation).

Algorithm 4 describes the recursive algorithm for the general $N = \ell^n$ case (for $n > 1$). The algorithm can also be easily adapted for mixed-kernels. Let us prove that the algorithm meets the systematic encoding requirement.

Observation 1 After round $i$ of the for-loop (beginning with $\ell - 1$ and ending with 0), codeword components $x_\tau$ such that $R_\ell[\tau] \geq i$ are not changed anymore by Algorithm 4.

Proof Equation (13) updates vector $x$ at the end of round $i$. Since $G$ is lower triangular, we always have that $\forall i' \in [\ell]_-; \ G_{i',j'} = 0$ for $j' > i'$. Therefore all the updates for rounds $i' < i$ of the for-loop will have zeros in the vector $G_{i'\rightarrow}$ in places corresponding to $x_\tau$ where $R_\ell[\tau] \geq i$. \hfill $\Diamond$
Observation 2 After round \( i \) of the for-loop (beginning with \( \ell - 1 \) and ending with 0), we have \( x_{m_N(t)} = u_t \) for all non-frozen components \( u_t \) such that \( R_t [m_N (t)] = i \), where \( i \in [\ell \ldots \ell] \) and \( t \in [N]_1 \).

Proof Let \( t \) be such that \( R_t [m_N (t)] = i \). Following (12) we have that \( \frac{N}{\ell} \cdot i \leq t \leq \frac{N}{\ell} \cdot (i + 1) - 1 \). Consequently, it is encoded at round \( i \) of the for-loop, dedicated for encoding \( C_i \). Assume that \( t = \frac{N}{\ell} \cdot i + r \) where \( r = R_N / \ell \). In (13) we have \( \bar{u}_r' = G_{t,i}^{-1} \cdot (u_t - x_{m_N(t)}) \). After applying the systematic encoder in (15), we have that \( \bar{x}_{m_N(r)} = \bar{u}_r' \). Following the execution of (16), we have \( x_{m_N(t)} = x_{m_N(t)} + \bar{x}_{m_N(t)/\ell} \cdot G_{t,i} \).

However, due to (13), we have \( \{m_N(t) / \ell\} = m_N / \ell \cdot (R_N / \ell) = m_N (r) \). Therefore, we have \( x_{m_N(t)} = x_{m_N(t)} + \bar{u}_r' \cdot G_{t,i} = u_t \). Owing to Observation 1 the value of \( x_{m_N(t)} \) will not further change in the algorithm, which proves the statement. \( \Box \)

Algorithm 4 Systematic Encoder for Homogenous Length \( N = \ell^n \) F-Symbols Polar Code, Based on Lower Triangular Kernel \( G \in F^{(\ell \times \ell)} \)

Input: \( u; z \).

//Initializations:
\( \triangleright \) Allocate \( \ell \) vectors \( \{\bar{u}^{(i)}\}_{i \in [\ell \ldots \ell]} \) of length \( \frac{N}{\ell} \) F-symbols.
\( \triangleright \) Allocate two vectors \( \bar{u}', \bar{x} \in F^{N/\ell} \).
\( \triangleright \) Allocate two vectors \( u, x \in F^N \). Initialize \( x = 0 \).

//Successively encode \( u \):
\( \triangleright \) For \( i = \ell - 1 \) to 0 Do //encode \( C_i \):
- Prepare vector \( \bar{u}' \) which serves as the modified input to the encoder of \( C_i \):
  \[
  \bar{u}'_r = \begin{cases} 
  0, & z_i, \; \text{otherwise.} \\
  G_{t,i}^{-1} \cdot (u_t - x_{m_N}(i, \frac{N}{\ell} + r)), & \forall r \in \left[\frac{N}{\ell}\right]_1 ;
  \end{cases}
  \tag{14}
  \]
- Run \( C_i \) systematic encoder:
  \[
  \bar{x}, \; \bar{u}^{(i)} = \text{SysEncoder} \left( \bar{u}', \; z_{i, \frac{N}{\ell} + r - 1} \right) ;
  \tag{15}
  \]
- Update the encoded vector \( x \)
  \[
  x_{r', \ell}^{(r' + 1) - \ell - 1} = x_{r', \ell}^{(r' + 1) - \ell - 1} + \bar{x}_{r'} \cdot G_{t,i} ;
  \forall r' \in \left[\frac{N}{\ell}\right]_1 ;
  \tag{16}
  \]

Output: \( \bar{x} ; \; u = [\bar{u}^{(0)}, \; \bar{u}^{(1)}, \ldots, \bar{u}^{(\ell - 1)}] \).

4 Recursive Descriptions of Polar Codes Decoding Algorithms

In this section we describe decoding algorithms for polar codes in a recursive framework that is induced from their recursive GCC structures. Roughly speaking, all the algorithms we consider here have a similar format. Consider the GCC structure of Figure 1. In this construction we have a length \( N \) symbols code that is composed of \( \ell \) outer-codes, denoted by \( \{C_i\}_{i=0}^{\ell - 1} \), each one of length \( N/\ell \) symbols. The decoding
algorithms that are considered here are composed of \( \ell \) pairs of steps. The \( i^{th} \) pair is dedicated to decoding \( \mathcal{C}_i \) as described in Algorithm 5.

### Algorithm 5 Decoding Outer-code \( \mathcal{C}_i, \ i \in [\ell] \cdot 

//STEP 2 \cdot i:
\[\triangleright\text{Using the previous steps, prepare the inputs to the decoder of outer-code } \mathcal{C}_i.\]

//STEP 2 \cdot i + 1:
\[\triangleright\text{Run the decoder of code } \mathcal{C}_i \text{ on the inputs you prepared.}\]
\[\triangleright\text{Process the output of this decoder, together with the outputs of the previous steps.}\]

Typically, the codes \( \{\mathcal{C}_i\}_{i=0}^{\ell-1} \) are polar codes of length \( N/\ell \) symbols, thereby creating the recursive structure of the decoding algorithm.

Note that the decoding algorithm structure in Algorithm 5 is quite typical for decoding algorithms of GCCs. As an example, see the decoding algorithms in Dumer’s survey on GCCs [24]. In addition, the recursive decoding algorithms for Reed-Muller (RM) codes, utilizing their Plotkin \((u + v, v)\) recursive GCC structure were extensively studied by Dumer [30, 31] and are closely related to the algorithms we present here. Actually, Dumer’s simplified decoding algorithm for RM codes [31, Section IV] is the SC decoding for Arikan’s structure, we describe in Subsection 4.1.

The algorithms we describe in a recursive fashion are the SC (Subsection 4.1), Tal and Vardy’s SCL (Subsection 4.2) and BP (Subsection 4.3). For all of these algorithms, we first consider Arikan’s \((u + v, v)\) code and then provide generalizations for other kernels, both homogenous and mixed. We note, that when possible, we prefer that the inputs to the algorithm and the internal computations are interpreted as log likelihood ratios (LLRs). Consequently, the SC algorithm and BP are described in such manner. In SCL, however, we need to be able to decide among different simultaneous decoding option, therefore we use log-likelihoods (LLs) instead of LLRs.

Furthermore, in our discussion we do not consider how to efficiently compute these quantities. In some cases, especially with large kernels or with large alphabet size, these calculations pose a computational challenge. Approaches to adhere this challenge, are efficient decoding algorithms (such as variants of Viterbi algorithms) or approximations of the computations (for example, the min-sum approximation that Leroux et al. used [15] or the near Maximum Likelihood (ML) decoding algorithms that were used by Trifonov [9]).

**Remark 1 (SCL Decoding with LLRs)** Balatsoukas-Stimming et al. presented an LLR based SCL decoder [19] in which the decoding options are selected based on a measure called the path-metric (PM). PM is a function of the computed LLRs and the already decided information symbols. It can be easily seen that tracking the PM measure can also be integrated into the recursive description given in Subsection 4.2. This can be achieved by introducing an additional data-structure to hold its computations.

### 4.1 A Recursive Description of the SC Algorithm

We begin by considering the SC decoder for Arikan’s \((u + v, v)\) construction. Description of the algorithm for generalized arbitrary kernels then follows. The inputs of the SC algorithm for Arikan’s construction are listed below.

- An \( N \) length vector of input LLRs, \( \lambda \), such that \( \lambda_j = \ln \frac{\Pr(Y_j = y_j | X_j = 0)}{\Pr(Y_j = y_j | X_j = 1)} \) for \( j \in [N] \cdot \), where \( Y_j \) is the measurement of the \( j^{th} \) channel \( X_j \rightarrow Y_j \).

- Vector indicator \( z \in \{0, 1\}^N \), in which \( z_i = 1 \) if and only if element number \( i \) of the information vector \( u \) is frozen.
The algorithm outputs the following structures.

- An $N$ length binary vector $\hat{u}$ containing the information word that the decoder estimated. This vector includes the frozen symbols placed in their appropriate positions.
- An $N$ length binary vector $\hat{x}$ which is the codeword corresponding to $\hat{u}$.

The SC function signature is defined as

$$ [\hat{u}, \hat{x}] = \text{SCDecoder}(\lambda, z). $$

First, let us describe the decoding algorithm for length $N = 2$ bits code, i.e. for the basic kernel $g^{(1)}(u, v) = (u + v, v)$. We get as input $\lambda = [\lambda_0, \lambda_1]$ which are the LLRs of the output of the channel ($\lambda_0$ corresponds to the first output of the channel and $\lambda_1$ corresponds to the second output). The procedure has four steps as described in Algorithm 6. Note that steps 1 and 3, may be done based on the LLRs computed on steps 0 and 2, respectively (i.e. by their sign), or by using an additional side information (for example, if $u$ is frozen, then the decision is based on its known value). A decoder for length $N$ polar code is described in Algorithm 7.

Let us now generalize this decoding algorithm for a GCC homogenous scheme with general kernel. In this case for length $N$ $F$-symbols code, we have an $\ell$ length mapping $g(u) = x$ over the $F$ alphabet, i.e. $g(\cdot) : F^\ell \rightarrow F^\ell$. The inputs and the outputs of the decoding algorithm are the same as in the $(u + v, v)$ case, except that here the LLRs may correspond to non-binary alphabet. As a consequence, we need to have $|F| - 1$ LLR input vectors $\{\lambda^{(t)}\}_{t \in F \setminus \{0\}}$ each one of length $N$ and defined such that

$$ \lambda^{(t)}_j = \ln \frac{\Pr(Y_j = y_j | X_j = 0)}{\Pr(Y_j = y_j | X_j = t)} $$

for $j \in [N]_-$, where $Y_j$ is the measurement of the $j^{th}$ channel $X_j \rightarrow Y_j$. Furthermore $u$ and $x$ are in $F^N$. Note that we always have $\lambda^{(0)}_j = 0$ and therefore it doesn’t have to be calculated. The following is the signature for the general SC decoder

$$ [\hat{u}, \hat{x}] = \text{SCDecoder}\left(\left\{\lambda^{(t)}\right\}_{t \in F \setminus \{0\}}, z\right). $$

Algorithm 6 SC of the $(u + v, v)$ Kernel

[ Rencontres ] Input: $\lambda$: \texttt{z}.

//STEP 0:
\begin{itemize}
\item \texttt{DCompute the LLR of $u$:} $\hat{\lambda} = 2 \tanh^{-1} \left(\tanh(\lambda_0/2) \tanh(\lambda_1/2)\right)$.
\end{itemize}

//STEP 1:
\begin{itemize}
\item \texttt{Decide on $u$, (denote the decision by $\hat{u}$).}
\end{itemize}

//STEP 2:
\begin{itemize}
\item \texttt{Decide on $v$, (denote the decision by $\hat{v}$).}
\end{itemize}

[ Rencontres ] Output:
- $\hat{u} = [\hat{u}, \hat{v}]$;
- $\hat{x} = [\hat{u} + \hat{v}, \hat{v}]$.

14
Algorithm 7 SC Recursive Description for Length $N = 2^n$ Bits $(u + v, v)$ Polar Code

Input: $\lambda$, $z$.

//STEP 0:
▷ Compute the LLR input vector, $\hat{\lambda}_0^{N/2-1}$, for the first outer-code such that
$$\hat{\lambda}_i = 2 \tanh^{-1} (\tanh(\lambda_{2i}/2) \tanh(\lambda_{2i+1}/2)), \quad \forall i \in [N/2]_-.$$  

//STEP 1:
▷ Give the vector $\hat{\lambda}$ as an input to the polar code decoder of length $N/2$. Also provide to the decoding algorithm, the indices of the frozen bits from the first half of the codeword (corresponding to the first outer-code), i.e. run
$$\left[\hat{\mathbf{u}}^{(0)}, \hat{\mathbf{x}}^{(0)}\right] = \text{SCDecoder} \left(\hat{\lambda}, z_0^{N/2-1}\right).$$ (18)

According to (17), $\hat{\mathbf{u}}^{(0)}$ is the information word estimation for the first outer-code, and $\hat{\mathbf{x}}^{(0)}$ is its corresponding codeword.

//STEP 2:
▷ Using $\lambda$ and $\hat{\mathbf{x}}^{(0)}$, prepare the LLR input vector, $\hat{\lambda}_0^{N/2-1}$, for the second outer-code, such that
$$\hat{\lambda}_i = (-1)^{\hat{\mathbf{x}}^{(0)}_i} \cdot \lambda_{2i} + \lambda_{2i+1}, \quad \forall i \in [N/2]_-.$$  

//STEP 3:
▷ Give the vector $\hat{\lambda}$ as an input to the polar code decoder of length $N/2$. In addition, provide the indices of the frozen bits from the second half of the codeword (corresponding to the second outer-code), i.e. run
$$\left[\hat{\mathbf{u}}^{(1)}, \hat{\mathbf{x}}^{(1)}\right] = \text{SCDecoder} \left(\hat{\lambda}, z_{N/2}^{N-1}\right),$$ (19)

where $\hat{\mathbf{u}}^{(1)}$ and $\hat{\mathbf{x}}^{(1)}$ are the estimations of the information word and its corresponding codeword of the second outer-code.

Output:
- $\hat{\mathbf{u}} = [\hat{\mathbf{u}}^{(0)}, \hat{\mathbf{u}}^{(1)}]$;
- $\hat{\mathbf{x}} = [\hat{\mathbf{x}}^{(0)}_1 + \hat{\mathbf{x}}^{(1)}_1, \hat{\mathbf{x}}^{(1)}_1]_{i=0}^{N/2-1}$.  

In the GCC structure of this polar code there exist at most \( \ell \) outer-codes \( \{C_i\} \), each one of length \( N/\ell \) symbols. We may have less than \( \ell \) outer-codes, in case some of the inputs are glued (which results in a mixed-kernels construction). In such cases, the outer-code corresponding to the glued inputs is considered to be over a larger size input alphabet. We assume that each outer-code has a decoding algorithm associated with it. This decoding algorithm is assumed to receive as input the "channel" observations on the outer-code symbols (usually manifested as probabilities matrices, or LLR vectors). If the outer-code is a polar code, then this algorithm should also receive the indices of the frozen symbols of the outer-code. We require that the algorithm outputs its estimation on the information vector and its corresponding outer-code codeword.

Let us first consider an \( \ell \) length code generated by a single application of the kernel i.e. \( x = g(u) \). Note that this is the base case of the recursion. Assuming that we already decided on symbols \( u_0^{t-1} \) (denote this decision by \( \hat{u}_0^{t-1} \)), computing the LLR vector \( \hat{\lambda}^{(t)} \) corresponding to the \( i^{th} \) input of the transformation (i.e. \( u_i \)) is done according to the following rule

\[
\hat{\lambda}^{(t)} = \ln \frac{\sum_{\mathbf{u}_i \in F^{t-1}} R_g (\hat{u}_i^{t-1}, 0, \mathbf{u}_i^{t+1})}{\sum_{\mathbf{u}_i \in F^{t-1}} R_g (\hat{u}_i^{t-1}, t, \mathbf{u}_i^{t+1})},
\]

(22)

where

\[
R_g (\mathbf{u}_0^{t-1}) = \exp \left( -\sum_{r=0}^{\ell-1} \lambda^{(t)}(x_r) \right), \text{ such that } x = g(u).
\]

Consequently, SC decoding for the \( \ell \) length polar code includes sequential calculations of the likelihood values \( \{\hat{\lambda}^{(t)}\}_{t \in F \setminus \{0\}} \) corresponding to non-frozen \( u_i \) according to (22) followed by a decision on \( u_i \) (denoted by \( \hat{u}_i \)) for \( i \in [\ell] \). If \( u_i \) is frozen, we set \( \hat{u}_i \) to be equal to its predetermined value. Finally in (21) we output \( \hat{u} = [\hat{u}_0 \ \hat{u}_1 \ldots \hat{u}_{\ell-1}] \), and \( \hat{x} = g(\hat{u}) \).

We now turn to describe the SC decoding algorithm for length \( N \geq \ell \) homogenous polar code over \( F \) based on the same kernel \( g(\cdot) \). As we already mentioned, due to the code structure, the decoding algorithm is composed of pairs of steps, such that the \( i^{th} \) pair deals with the \( i^{th} \) outer-code, where \( i \in [\ell] \).

We denote the information word that was estimated by the decoder of the \( m^{th} \) outer-code by \( \hat{x}^{(m)} \) and its corresponding codeword by \( \hat{x}^{(m)} \), both of them are of length \( N/\ell \) symbols. Algorithm 8 describes the pair of steps of the SC algorithm \( i \in [\ell] \) and Algorithm 9 specifies its output generation.

**Remark 2 (LLR Calculations Simplification for Linear Kernels)** Let us assume that \( g(\cdot) \) is an \( \ell \) dimensions linear kernel, having a generating matrix \( G \in F^{\ell \times \ell} \), such that \( x = g(u) = u \cdot G \). It can be easily seen that if \( \hat{x} = \hat{u}_0^{t-1} \cdot G_{0;1:1:00}^{-1} \), then (22) is equivalent to

\[
\hat{\lambda}^{(t)} = \ln \left( \frac{\sum_{\mathbf{v} \in \Gamma_t + \hat{x}} \exp \left( -\sum_{r=0}^{\ell-1} \lambda^{(t)}(x_r) \right)}{\sum_{\mathbf{v} \in \Gamma_t + \hat{x} + t \cdot G_{1:1:00}} \exp \left( -\sum_{r=0}^{\ell-1} \lambda^{(t)}(x_r) \right)} \right), \quad t \in F \setminus \{0\},
\]

(26)

where \( \Gamma_t = \{ \mathbf{v} \mid \mathbf{v} = \mathbf{w} \cdot G_{(i+1);(i+1);(i+1)} \} \), \( \mathbf{v} \in F^\ell \). Note that \( \Gamma_t \) is the linear code induced by the last \( \ell - 1 - i \) rows of the generating matrix \( G \). Furthermore \( \Gamma_t + \hat{x} \) is the coset of the linear code \( \Gamma_t \) that is induced by the coset vector \( \hat{x} \).

The calculation method in (26) is attractive because it implements the enumeration of the cosets members as a summation of \( \hat{x} \) (the estimated coset vector, computed throughout the algorithm) with predetermined sets \( \Gamma_t + t \cdot G_{1:1:00} \) (the cosets of \( \Gamma_t \) in \( \Gamma_{t-1} \)). Therefore, efficient ways to calculate (26) for the case of \( \hat{x} = 0 \) (e.g. using trellis decoding by employing the dual code of \( \Gamma_t \)) can be easily utilized for calculating (26) for non-zero \( \hat{x} \). This can be done by appropriately modifying the input LLR vector reflecting the notion that all the possible enumerated codewords are members of the cosets, considered in the case of \( \hat{x} = 0 \), shifted by the constant vector \( \hat{x} \). As a consequence, using as inputs the LLRs of a modified
Algorithm 8 SC Decoder Steps Dedicated for Outer-Code $C_i, \ i \in [\ell]_-$

//STEP 2 · $i$:
\[ \triangleright \text{Prepare } |F| - 1 \text{ LLR input vectors } \left\{ \hat{\lambda}^{(t)} \right\}_{t \in F\{0\}} \text{ each one of length } N/\ell \text{ using (24), i.e.} \]
\[
\hat{\lambda}_j^{(t)} = \ln \left( \frac{\sum_{w_{i+1} \in F_{i+1}} R_g \left( \hat{x}_j^{(0)}, \hat{x}_j^{(1)}, \ldots, \hat{x}_j^{(i-1)}, 0, w_{i+1}^{\ell-1} \right)}{\sum_{w_{i+1} \in F_{i+1}} R_g \left( \hat{x}_j^{(0)}, \hat{x}_j^{(1)}, \ldots, \hat{x}_j^{(i-1)}, t, w_{i+1}^{\ell-1} \right)} \right), \ \forall t \in F\{0\} \text{ and } \forall j \in [N/\ell]_,
\]

where $\hat{x}^{(m)}$ is the estimated codeword of outer-code $C^{(m)}$ that was computed at the previous steps, $m \in [i]_-$. Note that for the LLR calculation of $\hat{\lambda}_j^{(t)}$ in (24) we use input LLRs corresponding to channel indices $j \cdot \ell, j \cdot \ell + 1, \ldots, (j + 1) \cdot \ell - 1$.

//STEP 2 · $i + 1$:
\[ \triangleright \text{Decode the } i^{th} \text{ outer-code using the computed LLR vectors } \left\{ \hat{\lambda}^{(t)} \right\}_{t \in F\{0\}}, \text{ i.e.} \]
\[
\left[ \hat{u}^{(i)}, \hat{x}^{(i)} \right] = \text{SCDecoder} \left( \left\{ \hat{\lambda}^{(t)} \right\}_{t \in F\{0\}}, \left[ z^{(i+1) \cdot N/\ell - 1} \right]_{i \cdot N/\ell} \right). \tag{25}
\]

Algorithm 9 SC Decoder Output Generation

[◮◮◮] Output: \ (occurs after applying Algorithm 8 for all $i \in [\ell]_-$)

- $\hat{u} = \left[ \hat{u}^{(0)}, \hat{u}^{(1)}, \ldots, \hat{u}^{(\ell-1)} \right]$;
- $\hat{x}_{j \cdot \ell}^{(i+1) \cdot \ell - 1} = g \left( \hat{x}_j^{(0)}, \hat{x}_j^{(1)}, \ldots, \hat{x}_j^{(\ell-1)} \right), \ \forall j \in [N/\ell]_-$.
channel generated by adding the known vector $\hat{x}$ to the original channel output will allow to employ the computations of the zero case for general cases.

Algorithm 5 can be adapted to support the computation technique suggested here. First, we initialize the vector $\hat{x}$ (later given as output) to be the all-zeros vector. Secondly, we replace (24) by the following calculation

$$\lambda_j^{(t)} = \ln \left( \frac{\sum_{\mathbf{x} \in \Gamma_1} x_j^{(t)} \exp \left( - \sum_{r=0}^{t-1} \lambda_j^{(x_r)} \right)}{\sum_{\mathbf{x} \in \Gamma_1} \exp \left( - \sum_{r=0}^{t-1} \lambda_j^{(x_r)} \right)} \right), \quad t \in F \setminus \{0\}. \quad (27)$$

Thirdly, after estimating $\hat{x}^{(t)}$, the outer-code codeword of $C_i$ in (23), we need to update the coset vector $\hat{x}$ by calculating

$$\hat{x}_{j, t}^{(j+1) - 1} = \hat{x}_{j, t}^{(j+1) - 1} + \hat{x}^{(t)}, \quad \forall j \in \lceil N/\ell \rceil. \quad (28)$$

As a consequence, in Algorithm 4 the decoder can just output $\hat{x}$ that was calculated throughout the odd steps of the algorithm. This simplification is used in our suggested schematic implementation in Subsection 5.2.4.

In case we have a mixed-kernels construction, the generalization is quite easy. In order to illustrate this we consider an example of $\ell$ dimensions kernel in which we have glued the symbols $u_1$ and $u_2$ to a new symbol $u_{1, 2} \in F^2$ (see Example 3 for an instance of such structure). In this case, we treat these two symbols as one entity, and consider the outer-code associated with them, denoted as $C_{1, 2}$, as an $N/\ell$ length code over the alphabet $F^2$. The only change we have in the decoding algorithm is for the pair of decoding steps of Algorithm 5 corresponding to this "glued" symbols outer-code. For the first step in the pair, we need to compute $|F|^2 - 1$ LLR vectors $\{\lambda^{(t_0, t_1)}\}_{(t_0, t_1) \in F^2 \setminus \{(0, 0)\}}$ each one of length $N/\ell$. These vectors serve as an input to the the decoder of $C_{1, 2}$. In this case, each LLR component in the vector, is a function of both $u_1$ and $u_2$ inputs to the kernel. Equation (24) is therefore updated as follows:

$$\lambda_j^{(t_1, t_2)} = \ln \left( \frac{\sum_{\mathbf{w}_4 \in F^{t-3}} R_4 \left( \hat{x}_j^{(0)}, 0, 0, w_3^{t-1} \right)}{\sum_{\mathbf{w}_3 \in F^{t-3}} R_4 \left( \hat{x}_j^{(0)}, t_1, t_2, w_3^{t-1} \right)} \right), \quad \forall j \in \lceil N/\ell \rceil. \quad (29)$$

The second step of the pair in Algorithm 5 remains unchanged.

### 4.2 A Recursive Description of the SCL Algorithm

In this subsection we provide a recursive description of the SCL decoder, originally introduced by Tal and Vardy [11]. Each stage of the SCL algorithm involves comparisons of likelihoods of different SC decoding possibilities (resulting from keeping more than one decision option at the previous decoding stages). Therefore, we assume that the inputs to the algorithm as well as its internal computation values are interpreted as likelihoods, instead of LLRs. Note, however, that LLRs can be used as well, see Remark 4.5.

The SCL algorithm, described in this subsection, returns as output a list of decoding possibilities. The most likely element of this list should be given as output.

### 4.2.1 Sequential Decoders as Path Traversal Algorithms in Decoding Trees

Before dwelling into the details of SCL let us discuss the general idea that this algorithm entails. Sequential decoding algorithms examine their decision space (i.e. the set of all possible results) and choose a result from it, by gradually refining the space (i.e. eliminating some of the possible outcomes) until a

\[^3\] The notion of likelihoods normalization that was considered by Tal and Vardy [11] Algorithm 14 to avoid floating-point or fixed-point underflows is also applicable here and should be employed for numerical stability.
predetermined number of outcomes remains (in SC the number is 1, in SCL the number is $L$, from which the best outcome is chosen). In SC and SCL the decision space is described by the input vector to the encoder, $u$. The decision space is refined by determining the components of $u$ in a consecutive order.

It is quite common to describe the decision space of an algorithm by an edge-labeled directed tree dubbed a decoding tree. Note, however, that, strictly speaking, the graphical structure of the decoding tree is generally a forest, because we may have multiple nodes at the top of the tree (representing different input models) which are not connected to each other. These nodes are dubbed the roots of the decoding tree.

Figure 7 illustrates such a decoding tree used in sequential decoding of Arikan’s $(u + v, v)$ polar code.

The decoding tree is a layered graph, such that the edges of each layer correspond to a single entry of the vector $u$ (the layers boundaries are indicated by the dotted lines in Figure 7). The nodes in the graph indicate sequential decision junctions and the edges emanating from each node represent possible assignments to the variable of the layer. The single path between the roots of the decoding tree and a node appearing on the top of layer $u_i$ in the graph, indicates previous decisions (on variables $u_{i-1}$) that preceded the decision on $u_i$. Consequently, the paths between the root of the tree and the leaves of the tree correspond to all possible assignments to the vector $u$. For example, in Figure 7 the paths of the illustrated tree correspond to all the binary assignments to $u_{i+3}$ given that $u_{i-1}$ is a fixed prefix (indicated in the figure by the string 01001..., and further denoted by $\hat{u}_{i-1}$) and $u_{i+3} \in \{0, 1\}$.

In the SC algorithm the decoder always considers a single path (dubbed as a decoding-path) among the possible tree paths. The decoding-path is gradually paved by sequentially joining to it edges emanating from nodes reached by the previous stages. On stage $#i$ of the algorithm the edge selection corresponds to the most preferable assignment to the variable $u_i$ (assuming that the path leading to $u_i$’s layer is fixed).

When applied to polar codes, SC has an advantage in terms of its algorithmic complexity. Utilizing the recursive structure of the code, it is possible to efficiently compute the likelihoods needed for the
decision on $u_i$ by reusing previous computation results obtained when deciding on $u_{i-1}$. In other words, in SC the channel observation model is easily updated given some results of the calculations performed for determining the former observation model. The space needed to store these temporary calculations is linear in the code length (assuming the kernel size $\ell$ and the alphabet size $|F|$ are fixed). On the other hand, when SC algorithm decides on $u_i$ it does not take into account the existence of possible frozen values of its descendant nodes. In other words it assumes that all the assignments to $u_{N-1}$ are possible when calculating the likelihoods, even though the code structure enforces certain variables to be fixed. This lack of "future awareness" and the inability of the algorithm to change its past decisions (i.e. the algorithm always advances in one direction in the tree, from "top" to "bottom") are fundamental reasons for its sub-optimality.

The SCL algorithm with list of size $L$ is a generalization of SC, in which the decoder considers simultaneously at most $L$ possible decoding-paths. It can be seen that the complexity of SCL both in time and in space will be approximately bounded from above by the complexity of SC times $L$. The reason for this is that operations associated with each tree junction in SCL are roughly same ones that would have been associated to the junction if it were on a single SC path. We need however additional operations for choosing the $L$ edges with the maximal likelihoods for continuing the paths. This can be done in linear time (in $L$) per each decoding tree layer (information symbol, $u_i$). Secondly, tracking data structures need to be defined and utilized, in order to keep tabs on the existing decoding-paths while allowing an emulation of the SC algorithm for each separate decoding path. As we next see we can employ such structures and algorithms that will not exceed $L$ times the asymptotic complexity of SC.

In Figure 9 we described the SC and SCL algorithms as a sequence of decoding tree refinements. The models for these decisions are sequentially updated using past decisions (selected paths) and the current observation model. The connection between the input observation model and the input model for the next outer-code is defined by the inner-code layer. These updated input models are recursively provided to the smaller outer-codes, until we reach codes of single symbols (corresponding to the elements of $u_j$) in which decisions are made. It is indeed a property of the the recursive description of SC and SCL that each recursion step utilizes a decoding tree of which the layers are the outer-codes. A fundamental property of the algorithms is that on each recursion step, updating a model based on an edge selection is linear in the outer-code length (assuming that the kernel is fixed). As a consequence, the number of operations of SC is $O(N \cdot \log N)$.

4.2.2 Data Structures for Tracking Decoding Paths in SCL

Tracking the employed observation-model is easy in SC because at any given point in time we assume only a single model (induced by previous SC decisions). On the other hand, in SCL, multiple models are considered simultaneously and it is therefore required to efficiently keeping track of them. Specifically, for each constituent code of the GCC we must store the tree structure connecting between its outer-codes.

We now propose data structures for meeting this requirement.
Figure 9: SCL ($L = 4$) algorithm example of $(u + v, v)$ with $N = 8$ bits (see Figure 7) illustrated on the right a decoding tree on the outer-codes of the structure $(C_0, C_1)$. The left decoding tree expands each edge of the right tree into decoding-paths on the outer-codes of $C_0$ and $C_1$. The labels of the edges are the values of the outer-codes.

- **$S^{(e)}$** - an $\ell \times L$ matrix describing the edges of the decoding tree. Specifically, $S^{(e)}_{0 \to i}$ contains indications for the edges in the $C_0$ layer and $S^{(e)}_{1 \to i}$ corresponds to the $C_1$ layer. The only interesting nodes in the tree are the ones having a decoding path leading to them (we call them active nodes). We use the arbitrary convention that active nodes are assigned numbers in $[L]$ starting from the level's leftmost node to the rightmost node as appeared in the figure. To represent this in our data structure we let $S^{(e)}_{i,j}$ contain the index of the single node at the top of layer $i$ that is connected to node $j$ at the bottom of the layer. In case there are less than $L$ nodes at the bottom of a layer, the matrix entries corresponding to the missing nodes are assigned the null symbol, $\phi$.

- **$S^{(p)}$** - an $L \times \ell$ matrix, such that $S^{(p)}_{i \to j}$ defines the single path between the roots of the tree and the $i^{th}$ node at the bottom of the final layer. This path is specified in terms of the nodes indices, such that $S^{(p)}_{i,j}$ is the node located on the top of layer $j$ in the path. Note that $S^{(p)}$ is easily derived from $S^{(e)}$.

- **$s$** - an $L$ length vector describing the origin model for each decoding path, i.e. $s = \left( S^{(p)}_{i,0} \right)^T$.

- **$\hat{X}^{(i)}$** where $i \in [\ell] - \ell$ matrices (of dimensions $L \times N/\ell$) used for keeping the labels of the selected edges in $S^{(e)}$. Here $\hat{X}^{(i)}_{r \to i}$ contains the label of the edge pointing to node $r \in [L]_-$ at the bottom of layer $i \in [\ell]$ in the decoding tree. Note that this edge is represented by the $S^{(e)}_{i,r}$ entry.

- **$\hat{U}^{(i)}$** where $i \in [\ell] - \ell$ matrices (of dimensions $L \times N/\ell$), such that $\hat{U}^{(i)}_{r \to i}$ is the information word (including the assignment of the frozen symbols) of the outer-code corresponding to $\hat{X}^{(i)}_r$ codeword.

The decoding-paths data structures are generated throughout the decoding process. SCL sequential traversing the decoding tree from its first level to its leaves results in updating these matrices. After deciding on layer $i \in [\ell]_-$ we write row $i$ in $S^{(e)}$, and prepare matrices $\hat{X}^{(i)}$ and $\hat{U}^{(i)}$. Following $S^{(e)}$’s update, we prepare a new version of the paths matrix $S^{(p)}$ and its corresponding source vector $s$. Note that on stage $i$, we interpret $S^{(p)}_{r,0:i-1}$ as the single path leading from the roots to node $r$ at the bottom of layer $i$.

### 4.2.3 SCL Recursive Definition

Having defined the decoding paths tracking data structures, we are now ready to describe the SCL algorithms’s inputs and outputs. Consider SCL for the $(u + v, v)$ polar code of length $N$ bits with list size
The inputs of the algorithm are listed below.

- Two likelihood matrices $\Pi^{(0)}$ and $\Pi^{(1)}$ of $L \times N$ dimensions. Each row of the matrices corresponds to a different input observation model option, considered by the decoder. The plurality of models exists, due to SCL’s feature of constantly keeping a list of $L$ decoding-paths representing past decisions on the information word symbols. Each decoding-path induces a different statistical model, in which it is assumed that the information sub-vector, associated with it, is the one that was transmitted. We have
\[
\Pi^{(b)}_{i,j} = \Pr \left( Y^{(i)}_j = y^{(i)}_j | V_j = b \right),
\]
where $Y^{(i)}_j$ is the measurement of the $j$th channel $V_j \rightarrow Y_j$ of the $i$th option in the list and $b \in \{0, 1\}$.

- A scalar $\rho_{in}$ indicating how many rows in $\Pi^{(0)}$ and $\Pi^{(1)}$ are occupied. The algorithm supports tracking of $\rho_{in} \in [L]$ input models simultaneously.

- A vector indicator $z \in \{0, 1\}^N$, in which $z_i = 1$ if and only if the $i$th component of $u$ is frozen.

The algorithm outputs the following structures.

- A matrix $\hat{U}$ of $L \times N$ dimensions, which represents $L$ arrays of information values (each array of length $N$) - this is the list of the possible information words that the decoder estimated.

- A matrix $\hat{X}$ of $L \times N$ dimensions, which represents $L$ arrays of codewords (each array of length $N$) - this is the list of codewords that correspond to the information words in $\hat{U}$.

- A vector $s_0^{L-1}$, that indicates for each row in $\hat{U}$ and $\hat{X}$ to which row in the input $\Pi^{(0)}$ and $\Pi^{(1)}$ it has originated from (i.e. it refers to the statical model that was assumed when estimating this row).

- A scalar $\rho_{out}$ indicating how many rows in $\hat{U}$ or $\hat{X}$ are occupied.

The SCL function signature is defined as
\[
[\hat{U}, \hat{X}, s, \rho_{out}] = \text{SCLDecoder} \left( \{\Pi^{(b)}\}_{b \in \{0, 1\}}, \rho_{in}, z \right).
\]

For length $N = 2$ bits code (i.e. the code induced by a single application of the kernel) the procedure is described in Algorithm 10. In order to specify the SCL decoder for length $N = 2^n$ polar code, let us assume that we already developed an SCL decoder for length $N/2$ polar code. Using this assumption, a recursive decoder for length $N$ polar code is described in Algorithm 11. Let $T(n)$ be the decoding time complexity, for length $N = 2^n$ bits polar code. Then $T(n) = 2 \cdot T(n - 1) + O(L \cdot N)$, and $T(1) = O(L)$, which results in $T(n) = O(L \cdot N \cdot \log N)$. Similarly, the space complexity of the algorithm can be shown to be $O(L \cdot N)$.

The generalization of the decoding algorithm for a homogenous kernel of $\ell$ dimensions with alphabet $F$ is quite straight-forward. Here we emphasize the principal changes, from the $(u + v, v)$ case. Firstly, the only change in the input is that we should have $|F|$ channel matrices, $\Pi^{(b)}$, one for each alphabet symbol $b \in F$. With this change in alphabet the definition of each matrix in (30) remains. Consequently, the function signature is defined as follows.
\[
[\hat{U}, \hat{X}, s, \rho_{out}] = \text{SCLDecoder} \left( \{\Pi^{(b)}\}_{b \in F}, \rho_{in}, z \right).
\]

In the decoding algorithm, we have $\ell$ pairs of steps, such that each one is dedicated to a different outer-code. Before reaching step $2 \cdot i - 1$, we already decoded outer-codes $\{C_m\}_{m=0}^{i-1}$. Using the decoding tree terminology, we can say that we have traversed $i$ layers of the tree (starting from the roots) generating at most $L$ decoding-paths. As a result we have the paths tracking data structures $S^{(c)}, S^{(p)}, s, \{\hat{U}^{(m)}\}_{m=0}^{i-1}$.
Algorithm 10 SCL Decoding for the \((u + v, v)\) Kernel

\[
\begin{align*}
\text{Input:} & \quad \{\Pi^{(b)}\}_{b \in \{0, 1\}}; \rho_{in}; \mathbf{z}. \\
\text{Initialization:} & \quad \text{Initialize the decoding-paths data structures : } S^{(e)}, S^{(p)}, s, \hat{U}^{(0)}, \hat{X}^{(0)}, \hat{U}^{(1)} \text{ and } \hat{X}^{(1)}. \\
\text{STEP 0:} & \quad \triangleright \text{Generate two } \rho_{in} \text{ length vectors, } p^{(0)} \text{ and } p^{(1)}. \text{ For each of the } \rho_{in} \text{ occupied rows of } \Pi^{(0)} \text{ and } \Pi^{(1)} \\
& \quad \text{compute } p^{(0)}_r = \frac{1}{2} (\Pi^{(0)}_{r,0} \cdot \Pi^{(0)}_{r,1} + \Pi^{(1)}_{r,0} \cdot \Pi^{(1)}_{r,1}) \text{ and } p^{(1)}_r = \frac{1}{2} (\Pi^{(0)}_{r,0} \cdot \Pi^{(1)}_{r,0} + \Pi^{(1)}_{r,0} \cdot \Pi^{(0)}_{r,1}), \text{ for } r \in [\rho_{in}]_. \\
\text{STEP 1:} & \quad \triangleright \text{Concatenate the two vectors into one } 2 \cdot \rho_{in} \text{ length vector, } \mathbf{p} = [p^{(0)}, p^{(1)}]. \\
& \quad \triangleright \text{Let } \hat{\mathbf{p}} \text{ be a vector that contains the } \rho = \min\{2 \cdot \rho_{in}, L\} \text{ largest values of } \mathbf{p}. \\
& \quad \triangleright \text{For each } r \in [\rho]_. \text{ have } S^{(c)}_{0,r} = \sigma \text{ and } U_{r,0} = \beta \text{ if and only if the } r^{th} \text{ component of } \hat{\mathbf{p}} \text{ was originated from } p^{(\beta)}_{\sigma}. \text{ In other words, its source model is } \sigma \text{ and the decoding tree edge connecting between source model (represented by a node at the top level of the graph) and node } r \text{ at the bottom of the first layer has label } \beta. \\
\text{Remark:} \quad \text{If } u \text{ is frozen (without loss of generality assume that it is set to the 0 value), then steps } 0 \text{ and } 1 \text{ can be skipped and } \rho = \rho_{in}, S^{(e)}_{0,0,\rho_{in}-1} = [0, 1, \ldots, \rho_{in} - 1], \hat{U}^{(0)} = \mathbf{0}. \\
& \quad \triangleright \text{Update } S^{(p)} \text{ and } s \text{ accordingly.} \\
\text{STEP 2:} & \quad \text{Generate two } \rho \text{ length vectors, } p^{(0)} \text{ and } p^{(1)}. \text{ For each of the } \rho \text{ occupied rows of } S^{(p)} \text{ compute } \forall r \in [\rho]_. \\
& \quad p^{(0)}_r = \frac{1}{2} \left\{ \begin{array}{ll} \Pi^{(0)}_{s,0} \cdot \Pi^{(0)}_{s,1}, & \hat{U}^{(0)}_r = 0; \\
\Pi^{(1)}_{s,0} \cdot \Pi^{(0)}_{s,1}, & \hat{U}^{(0)}_r = 1. \end{array} \right. \quad (32) \\
& \quad p^{(1)}_r = \frac{1}{2} \left\{ \begin{array}{ll} \Pi^{(1)}_{s,0} \cdot \Pi^{(1)}_{s,1}, & \hat{U}^{(0)}_r = 0; \\
\Pi^{(0)}_{s,0} \cdot \Pi^{(1)}_{s,1}, & \hat{U}^{(0)}_r = 1. \end{array} \right. \quad (33) \\
\text{STEP 3:} & \quad \triangleright \text{Concatenate the two vectors into one } 2 \cdot \rho \text{ length vector, } \mathbf{p} = [p^{(0)}, p^{(1)}]. \\
& \quad \triangleright \text{Let } \hat{\mathbf{p}} \text{ be a vector that contains the } \rho_{out} = \min\{2 \cdot \rho, L\} \text{ largest values of } \mathbf{p}. \\
& \quad \triangleright \text{For each } r \in [\rho_{out}]_. \text{ have } S^{(c)}_{1,r} = \sigma \text{ and } \hat{U}^{(1)}_r = \beta \text{ if and only if the } r^{th} \text{ component of } \hat{\mathbf{p}} \text{ was originated from } p^{(\beta)}_{\sigma}. \\
\text{Remark:} \quad \text{If the second bit is frozen (without loss of generality assume that it is set to the 0 value), then steps } 2 \text{ and } 3 \text{ can be skipped and } S^{(e)}_{1,0,\rho_{out}-1} = [0, 1, \ldots, \rho_{out} - 1], \hat{U}^{(1)} = \mathbf{0}, \rho_{out} = \rho. \\
& \quad \triangleright \text{Update } S^{(p)} \text{ and } s \text{ accordingly.} \\
\text{Output:} & \quad \hat{U}_r = \left[ \hat{U}^{(0)}_r, \hat{U}^{(1)}_r \right], \forall r \in [\rho_{out}]_. \\
& \quad \hat{X} = \left[ \hat{U}^{(0)}_{\downarrow 0} + \hat{U}^{(0)}_{\downarrow 1}, \hat{U}^{(1)}_{\downarrow 0} \right]; \\
& \quad s; \\
& \quad \rho_{out}.
Algorithm 11 SCL Decoder for Length $N = 2^n$ Bits $(u+v, v)$ Polar Code

[>>>] Input: $\{\Pi(b)\}_{b \in \{0,1\}}^N$, $\rho_{in}$, $z$.

//Initialization: $\triangleright$ Initialize the decoding-paths data structures: $S^{(c)}$, $S^{(p)}$, $s$, $\hat{U}(0)$, $\hat{X}(0)$, $\hat{U}(1)$ and $\hat{X}(1)$.

//STEP 0: $\triangleright$ Prepare the probability transition matrices for the first outer-code decoder. Specifically, generate two matrices $P(b)$ of dimensions $L \times N/2$, $b \in \{0,1\}$, such that

$$P^{(0)}_{r,j} = \frac{1}{2} \left( \Pi_{r,2j}^{(0)} \cdot \Pi_{r,2j+1}^{(0)} + \Pi_{r,2j}^{(1)} \cdot \Pi_{r,2j+1}^{(1)} \right)$$

and

$$P^{(1)}_{r,j} = \frac{1}{2} \left( \Pi_{r,2j}^{(1)} \cdot \Pi_{r,2j+1}^{(1)} + \Pi_{r,2j}^{(0)} \cdot \Pi_{r,2j+1}^{(0)} \right), \quad \forall r \in [\rho_{in}], \forall j \in [N/2]_.$$  

//STEP 1: $\triangleright$ Decode the first outer-code using the updated channel model matrix, i.e.

$$\left[\hat{U}(0), \hat{X}(0), S^{(c)}_{\rightarrow \cdot}, \rho\right] = \text{SCLDecoder} \left(\left\{P(b)\right\}_{b \in \{0,1\}}, \rho_{in}, N/2-1\right).$$

$\triangleright$ Update $S^{(p)}$ and $s$ following (36).

//STEP 2: $\triangleright$ Prepare the input matrices for the decoder of the second outer-code of length $N/2$. Specifically, generate two matrices $P(b)$ of dimensions $L \times N/2$, $b \in \{0,1\}$, such that

$$P^{(0)}_{r,j} = \frac{1}{2} \left( \Pi_{sr,2j}^{(0)} \cdot \Pi_{sr,2j+1}^{(0)} + \Pi_{sr,2j}^{(1)} \cdot \Pi_{sr,2j+1}^{(1)} \right), \quad \hat{X}^{(0)}_{r,j} = 0;$$

and

$$P^{(1)}_{r,j} = \frac{1}{2} \left( \Pi_{sr,2j}^{(1)} \cdot \Pi_{sr,2j+1}^{(1)} + \Pi_{sr,2j}^{(0)} \cdot \Pi_{sr,2j+1}^{(0)} \right), \quad \hat{X}^{(1)}_{r,j} = 1, \quad \forall r \in [\rho_{in}], \forall j \in [N/2]_.$$  

//STEP 3: $\triangleright$ Decode the second outer-code using the updated channel model matrix, i.e.

$$\left[\hat{U}(1), \hat{X}(1), S^{(c)}_{\rightarrow \cdot}, \rho_{out}\right] = \text{SCLDecoder} \left(\left\{P(b)\right\}_{b \in \{0,1\}}, \rho, z_{N/2-1}\right).$$

$\triangleright$ Update $S^{(p)}$ and $s$ following (39).

[<<<] Output: $\cdot \hat{U}_{\rightarrow r} = \left[\hat{U}(0)_{S^{(p)}_r \rightarrow}, \hat{U}(1)_r\right], \quad \forall r \in [\rho_{out}]_;$

$\cdot \hat{X}_{r,\text{even}} = \hat{X}^{(0)}_{S^{(p)}_r \rightarrow} + \hat{X}^{(1)}_{\rightarrow r}, \quad \hat{X}_{r,\text{odd}} = \hat{X}^{(1)}_r, \quad \forall r \in [\rho_{out}]_;$

$\cdot s$;

$\cdot \rho_{out}$.

Here $\hat{X}_{r,\text{even}}$ ($\hat{X}_{r,\text{odd}}$) are the vectors of the even (odd) indices columns of row number $r$ in matrix $\hat{X}$.
and \( \{X^{(m)}(i-1) \}_{m=0}^{i-1} \) updated and describing the possible paths, that reach nodes at the top of the \( i \)th layer. Algorithm 12 elaborates on steps 2 \( \cdot i \) and 2 \( \cdot i + 1 \) which find the sequed to the \( L \) paths in layer \( i \) of the tree. The output generation of SCL is described in Algorithm 13.

Algorithm 12 SCL Decoding Steps Dedicated for Outer-Code \( C_i, \quad i \in [\ell] \).

// Let \( \rho \) be set to the number of active nodes at the top of layer \( i \). For \( i = 0 \) set \( \rho = \rho_m \).

**STEP 2 \( \cdot i \)**

▷ Using the decoding results of the outer-codewords from the previous steps i.e. \( \hat{X}^{(m)} \), for \( m \in [i-1] \), prepare the \( N/\ell \) length likelihood lists, \( \{P^{(b)}(i)\}_{b \in F^r} \). Each item in the list is an \( L \times N/\ell \) matrix, and all of them will serve as inputs to the decoder of the \( N/\ell \) length outer-code \#i. For the computation of row \( r \) of \( P^{(b)} \), use the input statistical model \( s_r \), that is the likelihoods in rows \( \{\Pi^{(b)}_{s_r} \}_{b \in F^r} \).

\[
P^{(b)}_{r,j} = \sum_{x \in A_{(r,j,b)}} \Pi_{s_{r-j-\ell}}^{(x_0)} \cdot \Pi_{s_{r-j-\ell+1}}^{(x_1)} \cdot \Pi_{s_{r-j-\ell+2}}^{(x_2)} \cdots \Pi_{s_{(j+1)-r-\ell-1}}^{(x_{\ell-1})}, \quad \forall r \in [\rho]_-, \forall j \in [N/\ell]_-.
\]

where \( A_{(r,j,b)} \) is defined to be the set of all possible codewords \( c = g(v) \) of the inner-code (defined by the kernel \( g() \)), having \( v_i = b \), and the prefix \( v_0^{1-1} \) defined by the \( i \)th decoding-path edge labels. Note that the \( r \)th decoding path nodes are \( \sigma = \left[ S^{(p)}_{r,0:(i-1)} , r \right] \) and their corresponding \( j \)th inner-code information prefix is \( v = \left[ \hat{X}^{(m)}_{s_{r-m+1,j}} \right]_{m=0}^{i-1} \). Consequently we have,

\[
A^{(r,j,b)}(\sigma) = \left\{ g(v) \left| v_{i}^{\ell-1} \in F^{\ell-1-i} \wedge v_i = b \wedge v_{i-1} = X^{(i-1)}_{r,j} \wedge v_m = \hat{X}^{(m)}_{s^{(p)}_{r,m+i,j}} \right. \right\}_{m \in [i-2]_-}.
\]

**STEP 2 \( \cdot i + 1 \)**

▷ SCL decode the \( i \)th outer-code using the updated channel model matrix, i.e.

\[
[\hat{U}^{(i)}, \hat{X}^{(i)}, S^{(c)}_{i \rightarrow}, \rho] = \text{SCLDecoder} \left( \left\{ P^{(b)} \right\}_{b \in F^r}, \rho, \hat{Z}_{i:N/\ell}^{(i+1)} \right).
\]

▷ Update \( S^{(p)} \) and \( s \) following 43.

The decoder for the basic \( N = \ell \) length code also contains \( \ell \) pairs of steps. The procedure is similar to Algorithm 12. However, instead of delivering the likelihood matrices \( \{P^{(b)}\}_{b \in F^r} \) (here these matrices are actually column vectors) to an outer-code decoder, we concatenate them to a vector \( \hat{p} \) and choose the \( \rho = \min \{L, |F| \ \cdot \rho \} \) maximal elements from it. Following this selection we update the decoding path tracking structures. This is a generalization of the case of \( N = 2 \) decoder in the \((u + v, v)\) construction.

In case the kernel is mixed, the generalization is also quite easy. Let us consider the mixed-kernels example, from the end of Subsection 4.1. The only changes we have in the decoding algorithm, are for the pair of steps in Algorithm 12 associated with the glued outer-code \( C_{0,1} \). In step 3 (the preparation step for this outer-code), we prepare \( |F|^2 \) input matrices \( D^{(b_1,b_2)} \), for all \((b_1, b_2) \in F^2\). In order to do this, we modify equations (41) and (42) replacing \( b \) with the pair \((b_1, b_2)\), corresponding to \( v_1 \) and \( v_2 \) in (12). The decoder of \( C_{1,2} \) is supposed to return a list of estimations of the information words, their corresponding codewords and the model indicator vectors. These outputs and the temporary structures are re-organized, as is done in step 2 \( \cdot r \) for the decoding algorithm of the homogenous kernel polar code. Note, however, that at the end of step 3, there are three information words lists \( \hat{U}^{(0)}, \hat{U}^{(1)} \) and \( \hat{U}^{(2)} \) along...
Algorithm 13 SCL Decoding Algorithm Output Generation

唛唛 | Output:  (occurs after applying Algorithm [12] for all $i \in [ℓ]_-$)
---
• $\hat{U} \to r = \begin{bmatrix} \hat{U}^{0}_{S^{(0)}_{r+1}} & \hat{U}^{1}_{S^{(1)}_{r+1}} & \cdots & \hat{U}^{(ℓ-2)}_{S^{(ℓ-2)}_{r+1}} & \hat{U}^{(ℓ-1)}_{r} \end{bmatrix}$, $\forall r \in [ρ]_-; $
• $\hat{X}_{r,(j\ell):(j+1)\ell-1} = g \left( \hat{X}^{0}_{S^{(0)}_{r+1}}, \hat{X}^{1}_{S^{(1)}_{r+1}}, \cdots, \hat{X}^{(ℓ-2)}_{S^{(ℓ-2)}_{r+1}}, \hat{X}^{(ℓ-1)}_{r} \right)$, $\forall r \in [ρ]_-$, $\forall j \in [N/ℓ]_-; $
• $s; $
• $ρ_{out} = ρ.$

with their corresponding three outer-code codewords lists. This is because we have decoded $C_{1,2}$'s glued symbols simultaneously, which resulted in retrieving $\hat{U}^{(1)}$, $\hat{U}^{(2)}$, $\hat{X}^{(1)}$ and $\hat{X}^{(2)}$ in a single decoding step.

4.3 A Recursive Description of the BP Algorithm

BP is an iterative message-passing decoding algorithm, which messages are sent over Forney’s normal factor graph [32]. Although being an alternative to SC decoding [1] there is no evidence which algorithm has better performance over general channels, except for the BEC, in which BP is shown to outperform SC [12]. Simulations, however, suggest that in many cases BP outperforms SC. On the other hand, SCL with small list size $L$ outperforms BP in many cases.

The order of sending the messages on the graph is called the schedule of the algorithm. Hussami et al. suggested employing the ”Z shape schedule” for transferring the messages [12] Section II.A]. In this correspondence we introduce a serial schedule which is induced from the GCC structure of the code.

We begin by describing the types of messages that are computed throughout the algorithm for the $(u + v, v)$ polar code. Figure 5 depicts the normal factor graph representation of Arikan’s kernel. We have four symbol half edges denoted by $u, v, x_0$ and $x_1$. These symbols have the following functional dependencies among them: $x_0 = u + v$ and $x_1 = v$. The messages and the inputs that are sent on the graph are assumed to be LLRs, and their values are taken from $\mathbb{R} \cup \{-∞, +∞\}$. The $-∞$ and $+∞$ are special types of LLR values that indicate known assignment of 0 and 1, respectively. They are used to support the existence of the polar code’s frozen symbols.

We associate four input LLR messages with the symbols half edges. These messages may be generated by the output of the channel, by known values associated with frozen bits or by computations that were done in this iteration or previous ones. We represent these messages by $µ^{(in)}_u$, $µ^{(in)}_v$, $µ^{(in)}_x_0$ and $µ^{(in)}_x_1$. The algorithm computes four output LLR messages, $µ^{(out)}_u$, $µ^{(out)}_v$, $µ^{(out)}_x_0$ and $µ^{(out)}_x_1$, indicating the estimations of $u, v, x_0$ and $x_1$, respectively, by the decoding algorithm. The messages are computed according to the extrinsic information principle, i.e. each message that is sent from a node on an adjacent edge is a function of all the messages that were previously sent to the node, except the message that was received over the particular edge. The nodes of the graphs are denoted by $a_0$ (the equality functional) and $e_1$ (the adder functional). Using the ideas mentioned above we have the following computation rules.

$$µ_{e_1 \to a_0} = f(=)(µ^{(in)}_x_1, µ^{(in)}_v), \quad (44)$$
$$µ_{a_0 \to e_1} = f(\pm)(µ^{(in)}_x_0, µ^{(in)}_u), \quad (45)$$
$$µ^{(out)}_u = f(\pm)(µ^{(in)}_x_0, µ^{(in)}_v, µ^{(in)}_{e_1 \to a_0}), \quad (46)$$
$$µ^{(out)}_v = f(=)(µ^{(in)}_x_1, µ^{(in)}_{a_0 \to e_1}), \quad (47)$$
$$µ^{(out)}_{x_0} = f(\pm)(µ^{(in)}_u, µ^{(in)}_{e_1 \to a_0}), \quad (48)$$

26
\[ \mu^{(\text{out})}_{z_1} = f_{(\pm)}(\mu^{(\text{in})}_{v}, \mu_{a_0 \rightarrow e_1}), \]

where \( f_{(\pm)}(z_0, z_1) \triangleq z_0 + z_1 \) and \( f_{(+)}(z_0, z_1) \triangleq 2 \tanh^{-1}(\tanh(z_0/2) \cdot \tanh(z_1/2)) \). We denote by \( \mu_{\alpha \rightarrow \beta} \) where \( \alpha, \beta \in \{e_1, a_0\} \) the message sent from node \( \alpha \) to node \( \beta \). \( \mu^{(\text{out})}_{v} \) and \( \mu^{(\text{out})}_{x_0} \) are sent from \( a_0 \) over the half edges corresponding to symbols \( u \) and \( x_0 \), respectively. \( \mu^{(\text{out})}_{v} \) and \( \mu^{(\text{out})}_{x_1} \) are sent from \( e_1 \) over the half edges corresponding to symbols \( v \) and \( x_1 \), respectively. Note that

\[ f_{(\pm)}(\pm \infty, z_1) = f_{(\pm)}(z_0, \pm \infty) = \pm \infty \]
\[ f_{(+)}(\pm \infty, z_1) = \pm z_1, \quad f_{(+)}(z_0, \pm \infty) = \pm z_0. \]

We now turn to give a recursive description of an iteration of the algorithm. As depicted in Figure 4, the factor graph of the length \( N \) bits code, has \( \log_2 N \) layers. In each layer, there exist \( N/2 \) copies of the kernel normal factor graph, depicted in Figure 5. As a consequence, for each layer, we have \( N/2 \) realizations of each type of input messages, output messages and inner messages (each one is corresponding to a different set of symbols and interconnect). To denote the \( i \)th realization of these messages, we use the notation \( \mu_{\alpha \rightarrow \beta,i} \), \( \mu_{v,i}^{(\text{in})} \) and \( \mu_{v,i}^{(\text{out})} \), where \( \alpha, \beta \in \{a_0, e_1\} \) and \( \gamma \in \{x_0, x_1, u, v\} \). As before, we denote the channel LLRs by the \( N \) length vector \( \lambda \). Each input message or inner message, unless given (by the channel output or by a prior knowledge on the frozen bits) is set to 0 before the first iteration. It is assumed that the inner messages are preserved between the iterations (and see a further discussion in the sequel).

Let us describe the BP decoder inputs and outputs for the \((u + v, v)\) code of length \( N \) bits. The inputs of the algorithm are the following.

- An \( N \) length vector of input LLRs, \( \lambda \), containing the observation from the channel.
- A pointer to a matrix \( M^{(u, \text{in})} \) of \( N \times \log_2(N) \) dimensions, which is used to hold the \( \mu^{(\text{in})}_{v} \) and \( \mu^{(\text{in})}_{u} \) messages between iterations. We employ a pointer here, because we would like to be able to change the values of the matrix as the algorithm progresses.
- A vector indicator \( z \in \{0, 1\}^N \), in which \( z_i = 1 \) if and only if the \( i \)th component of the information vector \( u \) is frozen.

The algorithm outputs the following structures.

- An \( N \) length binary vector \( \hat{u} \) containing the information word that the decoder estimated (including its frozen symbols).
- An \( N \) length vector \( \hat{x} \) containing the LLRs for the estimated codeword symbols. This structure is used to store the \( \mu^{(\text{out})}_{x_0} \) and \( \mu^{(\text{out})}_{x_1} \) messages.

The BP function signature is defined as follows

\[ [\hat{u}, \hat{x}] = \text{BPDecoder}(\lambda, M^{(u, \text{in})}, z). \]

Algorithm 14 outlines the BP iteration for length \( N > 2 \) code. Algorithm 15 completes this recursive description by considering the case of length \( N = 2 \) bits code. Note that in the algorithms we use aliases for several of our inputs in order to improve the procedure readability. We say that \( s \) is an alias for a variable \( w \) (and denote it by \( s \equiv w \)), if \( s \) is an alternative name for the memory space of \( w \), and therefore any algorithmic operation on \( w \) has the same results and side-effects as performing the operation on \( s \).

General schedules of BP may require to hold a dedicated memory for storing \( \mu^{(\text{in})}_{u,v}, \mu^{(\text{in})}_{v}, \mu^{(\text{in})}_{x_0}, \mu^{(\text{in})}_{x_1} \) and \( \mu_{a_0 \rightarrow e_1} \) type of messages that were previously computed. This memory may be needed for each realization of such messages, specifically, for each layer of the graph and for each \((u + v, v)\) normal subgraph, as in Figure 5. However, for our GCC schedule, excluding \( \mu^{(\text{in})}_{v} \), we do not need to save any message beyond the
Algorithm 14 BP Decoder of Length \( N = 2^n \) Bits \((u + v, v)\) Polar Code

\[\text{Algorithm 14 BP Decoder of Length } N = 2^n \text{ Bits (} u + v, v \text{) Polar Code}\]

\[\textbf{[(Console)] Input: } \lambda; M^{(u, \text{in})}; \mathbf{z}.\]

\[\text{//Initializations:}\]

We use the following aliasing for the inputs of the algorithm.

\[\mu^{(\text{in})}_{x_0,r} \equiv \lambda_2r \quad \text{and} \quad \mu^{(\text{in})}_{x_1,r} \equiv \lambda_{2r+1}, \quad \forall r \in [N/2]_-;\]

\[\mu^{(\text{in})}_{u,r} \equiv M^{(u, \text{in})}_{2r,0} \quad \text{and} \quad \mu^{(\text{in})}_{v,r} \equiv M^{(u, \text{in})}_{2r+1,0}, \quad \forall r \in [N/2]_-;\]

\[\text{//STEP 0:}\]

\[\triangleright \text{ Compute messages } \left[\mu^{(\text{out})}_{e_1 \rightarrow a_0,r}\right]_{r=0}^{N/2-1} \text{ using (44).}\]

\[\triangleright \text{ Compute messages } \left[\mu^{(\text{out})}_{v,r}\right]_{r=0}^{N/2-1} \text{ using (46).}\]

\[\text{//STEP 1:}\]

\[\triangleright \text{ Perform an iteration on the first outer-code: give the vector } \left[\mu^{(\text{out})}_{u,r}\right]_{r=0}^{N/2-1} \text{ as an input to the polar code BP iterative decoder of length } N/2 \text{ bits. Also provide the indices of the frozen bits from the first half of the codeword. The decoder outputs an estimation of the first outer-code codeword to } \left[\hat{\mu}^{(\text{in})}_{u,r}\right]_{r=0}^{N/2-1} \text{ (manifested as LLRs) and an estimation of its information word to the binary vector } \hat{\mathbf{u}}^{(0)}, \text{ i.e.}\]

\[\left[\hat{\mathbf{u}}^{(0)}, \left[\hat{\mu}^{(\text{in})}_{u,r}\right]_{r=0}^{N/2-1}\right] = \text{BPDecoder}\left[\left[\mu^{(\text{out})}_{u,r}\right]_{r=0}^{N/2-1}, M^{(u, \text{in})}_{0:(N/2-1), 1:(\log_2(N)-1)}, \mathbf{z}_{0}^{N/2-1}\right]. \quad (53)\]

\[\text{//STEP 2:}\]

\[\triangleright \text{ Compute the messages } \left[\mu^{(\text{out})}_{a_0 \rightarrow e_1,r}\right]_{r=0}^{N/2-1} \text{ using (45).}\]

\[\triangleright \text{ Compute the messages } \left[\mu^{(\text{out})}_{v,r}\right]_{r=0}^{N/2-1} \text{ using (47) (Note that these two steps can be combined into one computation).}\]

\[\text{//STEP 3:}\]

\[\triangleright \text{ Perform an iteration on the second outer-code: give the vector } \left[\mu^{(\text{out})}_{v,r}\right]_{r=0}^{N/2-1} \text{ as an input to the polar code BP iterative decoder of length } N/2 \text{ bits. Also provide the indices of the frozen bits from the second half of the codeword. The decoder outputs an estimation of the second outer-code codeword to } \left[\hat{\mu}^{(\text{in})}_{v,r}\right]_{r=0}^{N/2-1} \text{ (manifested as LLRs) and an estimation of its information word to the binary vector } \hat{\mathbf{u}}^{(1)}, \text{ i.e.}\]

\[\left[\hat{\mathbf{u}}^{(1)}, \left[\hat{\mu}^{(\text{in})}_{v,r}\right]_{r=0}^{N/2-1}\right] = \text{BPDecoder}\left[\left[\mu^{(\text{out})}_{v,r}\right]_{r=0}^{N/2-1}, M^{(u, \text{in})}_{N/2:(N-1), 1:(\log_2(N)-1)}, \mathbf{z}_{N/2}^{N-1}\right]. \quad (54)\]

\[\triangleright \text{ Compute messages } \left[\mu^{(\text{out})}_{e_1 \rightarrow a_0,r}\right]_{r=0}^{N/2-1} \text{ using (48).}\]

\[\triangleright \text{ Compute messages } \left[\mu^{(\text{out})}_{v,r}\right]_{r=0}^{N/2-1} \text{ and } \left[\mu^{(\text{out})}_{2v, r}\right]_{r=0}^{N/2-1} \text{ using (48) and (49), respectively.}\]

\[\textbf{[(Console)] Output: } \quad \bullet \hat{\mathbf{u}} = [\hat{\mathbf{u}}^{(0)}, \hat{\mathbf{u}}^{(1)}] ; \]

\[\bullet \hat{x}_{2r} = \mu^{(\text{out})}_{x_0,r} ; \quad \hat{x}_{2r+1} = \mu^{(\text{out})}_{x_1,r}, \quad \forall r \in [N/2]_- .\]
Algorithm 15 BP Decoder for Length $N = 2$ Bits $(u + v, v)$ Polar Code

[◮◮◮] Input: $\lambda; M^{(u,in)}; z$.

//Initializations:
▷ We use the following aliasing to the inputs of the algorithm.

\[
\mu_{x_0}^{(in)} := \lambda_0 \quad \text{and} \quad \mu_{x_1}^{(in)} := \lambda_1;
\]

▷ Initialize the $u, v$ input LLR messages (we assume that frozen variables are fixed to the zero value)

\[
\mu_w^{(in)} = \begin{cases} 
0, & w \text{ is not frozen}; \\
\infty, & w \text{ is frozen.}
\end{cases} \quad \forall w \in \{u, v\}. \tag{55}
\]

//STEP 0:
▷ Compute $\mu_{e_1 \rightarrow a_0}$ according to (44).

//STEP 1:
▷ If $u$ is not frozen, compute $\mu_{u}^{(out)}$ according to (46), and make a hard decision on this bit, based on its sign (denote it by $\hat{u}$). Otherwise, $\hat{u} = 0$.

//STEP 2:
▷ Compute $\mu_{a_0 \rightarrow e_1}$ according to (45).

//STEP 3:
▷ If $v$ is not frozen, compute $\mu_{v}^{(out)}$ according to (47), and make a hard decision on it, based on its sign (denote it by $\hat{v}$). Otherwise, $\hat{v} = 0$.

▷ Compute $\mu_{x_0}^{(out)}$ and $\mu_{x_1}^{(out)}$ according to (48) and (49), respectively.

[◇◇◇] Output:
\begin{itemize}
  \item $\hat{u} = [\hat{u}, \hat{v}]$.
  \item $\hat{x} = [\mu_{x_0}^{(out)}, \mu_{x_1}^{(out)}]$.
\end{itemize}
iteration boundary. This is because that in each iteration, all the messages except $\mu^{(in)}_v$ are re-computed before their first usage (in the iteration). The implication of this observation is that the required memory consumption can be reduced (see Subsection 5.1.6). Furthermore, the memory used for the other messages is temporary and needed only for the same iteration. It can be seen that the memory for these temporary messages is linear in the block length. The requirement to keep all the $\mu^{(in)}_v$ type of messages beyond the iteration boundary of the algorithm results in memory consumption of $\Theta (N \cdot \log(N))$.

In each iteration, we send one instance for each of the possible messages and for each $(u + v,v)$ block realization in the code, except for the $\mu_{e_1 \rightarrow a_0}$ type of message for which we send two messages (for all the layers, besides the last one). Consequently the iteration time complexity is $\Theta (N \cdot \log(N))$.

A complete BP implementation may require several iterations. The number of iterations may be fixed or set adaptively, which means that the algorithm continues until some consistency constraints are satisfied. An example for such a constraint, is that the signs of the LLR estimations for all the frozen bits agree with their know values (i.e. if all the frozen bits are set to zero, then $\mu^{(out)}_w > 0$ for all the frozen bits, $w$). In this case, it is possible to stop an iteration in the middle by keeping a counter in a similar way to the method that is usually used in BP decoding of LDPC codes using the check-node based serial schedules (see e.g. [33]). We note, however, that in the LDPC case, the consistency is manifested by the fact that all the parity check equations are satisfied.

Let us now consider BP for polar codes with general kernels. For this description we require the kernels to be linear and be represented by a lower triangular generating matrix. The input to the kernel and the output of the kernel are $\ell$ length vectors $u$ and $x \in \mathbb{F}^\ell$, respectively, satisfying $x = g(u) = u \cdot G$, where $G$ is an $\ell \times \ell$ lower triangular generating matrix. Figure 10a depicts a normal factor graph for such an $\ell$ dimensions binary kernel. We have that an edge $e_i \rightarrow a_j$ exists in the graph if and only if $G_{i,j} \neq 0$. Being a lower triangular matrix means that in the factor graph there are no edges $e_i \rightarrow a_j$, such that $j > i$. In case the kernel is non-binary, each edge $e_i \rightarrow a_j$ also has a label equal to its $G_{i,j}$ value. For example, Figure 10b depicts the normal factor graph corresponding to the $RS3$ kernel, with generating matrix

$$G_{RS3} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ \alpha^2 & \alpha & 1 \end{bmatrix}.$$  

Similarly to the discussion on the $(u + v,v)$ code, we define input messages to the factor graph denoted by $\mu^{(in)}_{u_i(t)}$ and $\mu^{(in)}_{x_j(t)}$ and their corresponding output messages $\mu^{(out)}_{u_i(t)}$ and $\mu^{(out)}_{x_j(t)}$ where $i,j \in [\ell]_-$ and $t \in F\setminus\{0\}$. Moreover, we have messages $\mu^{(t)}_{e_i \rightarrow a_j}$ and $\mu^{(t)}_{a_j \rightarrow e_i}$ for every edge $e_i \rightarrow a_j$. All of these messages
are LLRs such that for a message $\mu^{(t)}$ we have $\mu^{(t)} = \ln \left( \frac{\Pr(y|\omega=0)}{\Pr(y|\omega=1)} \right)$, where $y$ is a vector of observations, and $\omega$ is the variable associated with the edge on which the message $\mu^{(t)}$ is transmitted. In case the code is binary the letter indication $t$ may be omitted. All the messages are calculated using the (typically false) assumption that the factor graph is cycle free, and consequently for each node the messages sent to it are statistically independent. Let $a_j$ be an adder node corresponding to column $j$ of the generating matrix $G_{i,j}$, such that $\sum_{t=0}^{\ell-1} G_{i,j} \cdot u_j = x_j$. We have for $i, j \in [\ell]$ and $i \geq j$

$$
\mu^{(t)}_{\tau \rightarrow e_i \rightarrow a_j} = f_{(\tau)} \left( t, \left[ \mu^{(\tau)}_{\tau \rightarrow e_i} \right]_{\tau \in F \setminus \{0\}}, \left[ \mu^{(in)}_{\tau} \right]_{\tau \in F \setminus \{0\}} \right); 
$$

$$
\mu^{(t)}_{a_j \rightarrow e_i} = f_{(+)} \left( t, \left[ \mu^{(\tau)}_{\tau \rightarrow a_j} \right]_{\tau \in F \setminus \{0\}}, \left[ \mu^{(in)}_{\tau} \right]_{\tau \in F \setminus \{0\}} \right); 
$$

$$
\mu^{(out)}_{\tau \rightarrow e_i} = f_{(-)} \left( t, \left[ \mu^{(\tau)}_{\tau \rightarrow e_i} \right]_{\tau \in F \setminus \{0\}} \right); 
$$

$$
\mu^{(out)}_{a_j \rightarrow e_i} = f_{(+)} \left( t, \left[ \mu^{(\tau)}_{\tau \rightarrow a_j} \right]_{\tau \in F \setminus \{0\}} \right). 
$$

The functions $f_{(-)}(\cdot)$ and $f_{(+)}(\cdot)$ are generalizations of the functions that were presented before for the $(u+v,v)$ case. Note that for these functions, the number of input arguments that follow $t$ (the alphabet symbol) may vary. This number is equal to the degree of the sending node minus one. Consequently, we denoted them in (67)-(60) as vector of vectors (i.e. using the \( \left[ \mu^{(\tau)}_{\tau} \right]_{\tau \in F \setminus \{0\}} \) notation), however it should be understood that each element of this vector of vectors is a different argument to the functions.

Given an equality node $e_i$ with degree $d$, the function $f_{(\tau)}(\cdot)$ is defined as follows

$$
\sum_{\tau \in F \setminus \{0\}} \sum_{r=0}^{d-2} \mu^{(\tau)}_{\tau \rightarrow e_i} \\left[ \mu^{(\tau)}_{\tau \rightarrow e_i} \right]_{\tau \in F \setminus \{0\}}, \left[ \mu^{(in)}_{\tau} \right]_{\tau \in F \setminus \{0\}} \right) \triangleq \sum_{r=0}^{d-2} \mu^{(t)}_{\tau \rightarrow e_i} 
$$

where $t \in F \setminus \{0\}$ and \( \left[ \mu^{(\tau)}_{\tau} \right]_{\tau \in F \setminus \{0\}} \) are LLR messages received at the node from $d-1$ edges adjacent to it. Denote the variable associated with these edges by $\omega_r$ for $r \in [d-1]$ and let $\omega$ denote the variable associated with the edge which messages were not given as input (we refer to this edge as the ”missing edge”). The output of this function is the LLR message sent from $e_i$ on the missing edge. Being a repetition constraint ($\omega = \omega_r$ for all $r \in [d-1]$) the LLR calculated in (67) appears as a summation of the LLRs corresponding to the same alphabet letter $t$. See Figure 11A for an illustration of this case.

Given an adder node $a_j$ with degree $d$, the function $f_{(+)}(\cdot)$ is defined as follows

$$
\ln \left( \sum_{\tau \in F \setminus \{0\}} \sum_{r=0}^{d-2} \mu^{(\tau)}_{\tau \rightarrow e_i} \\left[ \mu^{(\tau)}_{\tau \rightarrow e_i} \right]_{\tau \in F \setminus \{0\}}, \left[ \mu^{(in)}_{\tau} \right]_{\tau \in F \setminus \{0\}} \right) \triangleq \ln \left( \frac{\sum_{\tau \in F \setminus \{0\}} \sum_{r=0}^{d-2} \mu^{(\tau)}_{\tau \rightarrow e_i} \\left[ \mu^{(\tau)}_{\tau \rightarrow e_i} \right]_{\tau \in F \setminus \{0\}}, \left[ \mu^{(in)}_{\tau} \right]_{\tau \in F \setminus \{0\}} \right).
$$

where $t \in F \setminus \{0\}$, $\mu^{(0)}_{\tau}$ and \( \left[ \mu^{(\tau)}_{\tau} \right]_{\tau \in F \setminus \{0\}} \) are LLR messages received at the node from $d-1$ edges adjacent to it, $r \in [d-1]$. Let us denote the variables corresponding to these edges by $\{\omega_r\}_{r=0}^{d-2}$ and the variable corresponding to the missing edge by $\omega$. It is assumed that the generating matrix equation corresponding to node $a_j$ is

$$
\omega = \sum_{r=0}^{d-2} \gamma_r \cdot \omega_r. 
$$
Consequently, the set $A(\gamma, t)$ is defined as the set of all assignments to $\omega_{d-2}$, such that $\omega = t$ in (63), i.e. $A(\gamma, t) = \left\{ \omega_{d-2} \big| \sum_{r=0}^{d-2} \gamma_r \cdot \omega_r = t \right\}$. See Figure 11b for an illustration of this case. Note that naive calculation of (62) for all $t \in F \setminus \{0\}$ has time complexity of $O(|F| d)$ while computation that uses trellis has complexity of $O(d \cdot |F|^2)$.

We are now ready to describe the BP algorithm for general lower triangular kernels. The inputs of the algorithm are as follows.

- $N$ length vectors of input LLRs, $\{\lambda(t)\}_{t \in F \setminus \{0\}}$, containing the observations of the channel ($t$ indicates the code alphabet letters).

- Pointers to matrices $\left\{ \{M(w_i, u_{in})(t)\}_{t \in F \setminus \{0\}} \right\}_{i \in [\ell]}$ of $N/\ell \times \log_\ell(N)$ dimensions, which are used to hold the $\mu_{u_i}^{(in)}(t)$ messages between iterations. Pointers are employed here, because we would like to be able to change the values of the matrix as the algorithm progresses.

- Pointers to matrices $\left\{ \{M(a_j \rightarrow e_i)(t)\}_{t \in F \setminus \{0\}} \right\}_{0 \leq j \leq i \leq \ell - 1}$ of $N/\ell \times \log_\ell(N)$ dimensions, which are used to hold the $\mu_{a_j \rightarrow e_i}^{(i)}$ messages between iterations.

- Vector indicator $z \in \{0, 1\}^N$, in which $z_i = 1$ if and only if the $i^{th}$ component of the information vector $u$ is frozen.

The algorithm outputs the following structures.

- An $N$ length vector $\hat{u} \in F^N$, containing the information vector that the decoder estimated (including its frozen symbols).

- $N$ length vectors $\{\hat{x}(t)\}_{t \in F \setminus \{0\}}$ which are the LLRs for the estimated codeword. This structure is used for delivering the $\mu_{x_j}^{(out)}(t)$ messages.

The BP function signature is defined as follows (note that the $i, j$ in the third argument are limited such that $0 \leq j \leq i \leq \ell - 1$)

$$\begin{bmatrix} \hat{u}, \{\hat{x}(t)\}_{t \in F \setminus \{0\}} \end{bmatrix} = \text{BPDecoder} \left( \left\{ \lambda(t) \right\}_{t \in F \setminus \{0\}}, \left\{ M^{(w_{in})}(t) \right\}_{t \in F \setminus \{0\}}, \left\{ M^{(a_j \rightarrow e_i)}(t) \right\}_{t \in F \setminus \{0\}}, z \right).$$

We start with the $N = \ell$ symbols case. Algorithm 16 gives the description for this case. Algorithm 17 consider the $N > \ell$ symbols case.
Algorithm 16 BP Decoder for Length $N = \ell$ Symbols Polar Code

\[ \text{Input: } \left\{ \lambda^{(t)} \right\}_{t \in F \setminus \{0\}}; \left\{ M^{(u, \text{in})(t)} \right\}_{t \in F \setminus \{0\}}; \left\{ M^{(a_j \rightarrow e_i)(t)} \right\}_{t \in F \setminus \{0\}} \right\}_{i,j}; z. \]

//Initializations:

\[ \text{Use the following aliases to the inputs of the algorithm.} \]

\[ \mu^{(\text{in})(t)}_{u_i} \equiv \left[ \lambda^{(t)}_{j} \right]_{t \in F \setminus \{0\}}, \forall j \in [\ell]_--; \]

\[ \mu^{(t)}_{a_j \rightarrow e_i} \equiv M^{(a_j \rightarrow e_i)(t)}_{0,0}, \forall t \in F \setminus \{0\}, \forall 0 \leq j \leq i \leq \ell - 1. \]

\[ \text{Initialize the vector } \left[ \mu^{(\text{in})(t)}_{u_i} \right]_{t \in F \setminus \{0\}} \]

\[ \mu^{(\text{in})(t)}_{u_i} = \begin{cases} 0, & z_i = 0; \\ \infty, & z_i \neq 0. \end{cases}, \forall i \in [\ell]_-, \forall t \in F \setminus \{0\}. \]

//Iteration:

\[ \text{For } j = \ell - 1 \text{ to } 0 \text{ Do} \]

\[ \text{Compute } \left[ \mu^{(t)}_{e_i \rightarrow a_j} \right]_{t \in F \setminus \{0\}}, \forall i, \text{ s.t. } j < i \leq \ell - 1 \text{ using } (57); \]

\[ \text{Compute } \left[ \mu^{(t)}_{a_j \rightarrow e_i} \right]_{t \in F \setminus \{0\}} \text{ using } (58). \]

\[ \text{For } i = 0 \text{ to } \ell - 1 \text{ Do} \]

\[ \text{Compute } \left[ \mu^{(t)}_{e_i \rightarrow a_j} \right]_{t \in F \setminus \{0\}}, \forall j, \text{ s.t. } 0 \leq j < i \text{ using } (58); \]

\[ \text{If } u_i \text{ is not frozen, compute } \left[ \mu^{(\text{out})(t)}_{u_i} \right]_{t \in F \setminus \{0\}} \text{ according to } (59), \text{ and make a hard decision on this symbol, based on the LLR vector (denote the hard decision by } \hat{u}_i) \text{. If } u_i \text{ is frozen, set } \hat{u}_i = 0; \]

\[ \text{Compute } \left[ \mu^{(t)}_{e_i \rightarrow a_j} \right]_{t \in F \setminus \{0\}}, \forall j, \text{ s.t. } 0 \leq j < i \text{ using } (57). \]

\[ \text{Compute } \left[ \mu^{(t)}_{x_j} \right]_{t \in F \setminus \{0\}}, \forall j \in [\ell]_- \text{ using } (60). \]

//Output:

\[ \hat{u} = [\hat{u}_0, \hat{u}_1, \ldots, \hat{u}_{\ell - 1}]; \]

\[ \hat{x} = \left[ \left[ \mu^{(\text{out})(t)}_{e_0} \right]_{t \in F \setminus \{0\}}, \left[ \mu^{(\text{out})(t)}_{e_1} \right]_{t \in F \setminus \{0\}}, \ldots, \left[ \mu^{(\text{out})(t)}_{e_{\ell - 1}} \right]_{t \in F \setminus \{0\}} \right]. \]
Algorithm 17 BP Decoder of Length $N = \ell^n$ F-Symbols Polar Code

[▷▷▷] Input: $\{\lambda^{(t)}\}_{t \in F \setminus \{0\}}$; $\{M^{(a_{in})(t)}\}_{t \in F \setminus \{0\}}$; $\{\{M^{(a_j \rightarrow e_i)}(t)\}_{t \in F \setminus \{0\}}\}_{i,j}$; $z$.

//Initializations:
▷ Use the following aliases to the inputs of the algorithm.
\[
\mu^{(in)(t)}_{x_j,r} := [\lambda^{(t)}_{r \cdot \ell + j}]_{t \in F \setminus \{0\}} \quad \forall j \in [\ell]_- \text{ and } \forall r \in [N/\ell]_-;
\]
\[
\mu^{(in)(t)}_{a_{in},r} := [M^{(a_{in})(t)}_{r,0}]_{t \in F \setminus \{0\}} \quad \forall i \in [\ell]_- \text{ and } \forall r \in [N/\ell]_-;
\]
\[
\mu^{(t \rightarrow e_i)}_{a_j \rightarrow e_i,r} := M^{(a_j \rightarrow e_i)}(t)_{r,0} \quad \forall t \in F \setminus \{0\}, \forall 0 \leq j \leq \ell - 1 \text{ and } \forall r \in [N/\ell]_-.
\]

//Iteration:
▷ For $j = \ell - 1$ to 0 Do
  • Compute $[\mu^{(t \rightarrow a_j,r)}_{e_i}]_{t \in F \setminus \{0\}}$, \forall i, \text{ s.t. } j < i \leq \ell - 1 \text{ and } \forall r \in [N/\ell]_- \text{ using } (57);
  • Compute $[\mu^{(t \rightarrow e_j)}_{a_j \rightarrow e_i}]_{t \in F \setminus \{0\}}$ and $\forall r \in [N/\ell]_- \text{ using } (58)$.
▷ For $i = 0$ to $\ell - 1$ Do
  • Run steps $2 \cdot i$ and $2 \cdot i + 1$ of Algorithm 18.
▷ Compute $[\mu^{(\text{out})(t)}_{x_j}]_{t \in F \setminus \{0\}}$, $\forall j \in [\ell]_- \text{ using } (60)$.

[◆◆◆] Output: $\hat{u} = [\hat{u}^{(0)}, \hat{u}^{(1)}, \ldots, \hat{u}^{(\ell-1)}]$;
  • $\hat{x}^{(t)}_{r \cdot \ell + j} = \mu^{(\text{out})(t)}_{x_j,r}$, $\forall j \in [\ell]_-$, $\forall t \in F \setminus \{0\}$ and $\forall r \in [N/\ell]_-$.  

34
Algorithm 18 BP Iterations Steps Dedicated for Decoding of Outer-Code $C_i$, $i \in [\ell]$

//STEP 2 · i:

\(\triangleright\) Compute \(\mu_{a_j \rightarrow e_i, r}^{(t)}\) $t \in F \setminus \{0\}$, \(\forall j\), s.t. \(0 \leq j < i\) and \(\forall r \in [N/\ell]_-\) using (58).

\(\triangleright\) Compute \(\mu_{ui, r}^{(out)}(t)\) $t \in F \setminus \{0\}$, \(\forall r \in [N/\ell]_-\) according to (59).

//STEP 2 · i + 1:

\(\triangleright\) Give the vector \(\left\{ \mu_{ui, r}^{(out)}(t) \right\}_{t \in F \setminus \{0\}}^{N/\ell - 1} \) as an input to the polar code decoder of length $N/\ell$ symbols. Also provide to this decoder the indices of the frozen symbols corresponding to $C_i$ and pointers to the matrices containing the messages of this outer-code. Assume that the decoder outputs

\[
\hat{\mathbf{u}}^{(t)} = \text{BPDecoder} \left( \left\{ \mu_{ui, r}^{(out)}(t) \right\}_{r=0}^{N/\ell - 1} \right) ;
\]

\[
\begin{align*}
\left\{ M^{(n, i, \text{in})}_{i-N/\ell:(i+1)-N/\ell-1},1:(\log_2 N-1) \right\}_{t \in F \setminus \{0\}} ; \\
\left\{ M^{(a_j \rightarrow e_i, r)}_{i-N/\ell:(i+1)-N/\ell-1},1:(\log_2 N-1) \right\}_{t \in F \setminus \{0\}} ;
\end{align*}
\]

\[
\left\{ z^{(i+1)-N/\ell-1}_{i-N/\ell} \right\}_{i \in [\ell]}
\]

// Note that \(i', j'\) in the third argument of (65) are limited such that \(0 \leq j' \leq i' \leq \ell - 1\).

\(\triangleright\) Compute \(\mu_{e_i \rightarrow a_j, r}^{(t)}\) $t \in F \setminus \{0\}$, \(\forall j\), s.t. \(0 \leq j < i\) and \(\forall r \in [N/\ell]_-\) using (58).
Thus far we discussed homogenous kernels. BP on mixed-kernels polar codes can be defined in a similar manner. In mixed-kernels structures we have at least two types of constituent kernels, each one with different alphabet. In order to connect these kernels, we combine several input symbols of the first kernel and consider them as a single entity for decoding purposes. We say that these symbols are "glued" together, thereby creating a symbol of the larger-alphabet kernel. The output symbols of the larger alphabet size kernel are given as input to the glued input entry of the inner mapping defined by the first kernel. In order to support this gluing operation we introduce an additional node to the normal factor graph, and label it by the '&\&' symbol. This node serves as a "bridge" between the two alphabets.

Example 4 (BP on Mixed-Kernels) Let us consider the mixed-kernels code discussed in Example 3. In this example we use the $G = \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} \otimes 2$ binary matrix as our first kernel and glue its input components $u_1$ and $u_2 \in GF(2)$ into one entity called $u_{(1,2)} \in GF(4)$. The second kernel is the RS4 kernel described by the generating matrix (??). All the BP messages sent over the edges of this kernel and the RS4 kernel were already discussed above, except the ones sent and received by the $&\&_{(1,2)}$ node. Note that the correspondence between the binary representation of $u_{(1,2)}$ and its representation over $GF(4)$ is as follows: $[u_2, u_1] = [0, 0] \equiv 0; [0, 1] \equiv 1; [1, 0] \equiv \mu$ and $[1, 1] \equiv \mu^2$.

\begin{align*}
\mu_{k_{(1,2)} \rightarrow e_1} &= \ln \left( \frac{\exp \left\{ -\mu_{e_2 \rightarrow k_{(1,2)}} - \mu_{u_{(1,2)}} \right\} + 1}{\exp \left\{ -\mu_{e_2 \rightarrow k_{(1,2)}} - \mu_{u_{(1,2)}} \right\} + \exp \left\{ -\mu_{u_{(1,2)}} \right\}} \right) \\
\mu_{k_{(1,2)} \rightarrow e_2} &= \ln \left( \frac{\exp \left\{ -\mu_{e_1 \rightarrow k_{(1,2)}} - \mu_{u_{(1,2)}} \right\} + 1}{\exp \left\{ -\mu_{e_1 \rightarrow k_{(1,2)}} - \mu_{u_{(1,2)}} \right\} + \exp \left\{ -\mu_{u_{(1,2)}} \right\}} \right)
\end{align*}

\begin{align}
\mu_{u_{(1,2)}}^{(\text{out})(\alpha)} &= \mu_{e_2 \rightarrow k_{(1,2)}} \\
\mu_{u_{(1,2)}}^{(\text{out})(1)} &= \mu_{e_1 \rightarrow k_{(1,2)}} \\
\mu_{u_{(1,2)}}^{(\text{out})(\alpha^2)} &= \mu_{e_1 \rightarrow k_{(1,2)}} + \mu_{e_2 \rightarrow k_{(1,2)}}
\end{align}

We use the following aliases between the messages mentioned in (66)–(70) and the messages of the standard homogenous kernel defined in (27)–(30): $\mu_{u_i}^{(\text{in})} \equiv \mu_{k_{(1,2)} \rightarrow e_i}; \mu_{u_i}^{(\text{out})} \equiv \mu_{e_i \rightarrow k_{(1,2)}}$; for $i \in \{1, 2\}$. The BP schedule suggested in Algorithm 17 is preserved, i.e. each iteration starts in an initialization step and then moves to BP decoding of its outer-codes. Messages (68)–(70) are computed before calling the BP decoder of the RS4 outer-code, in order to convert binary LLRs into quaternary ones. Moreover, messages (66) and (67) are employed after the BP iteration on the RS4 outer-code has finished, in order to convert the quaternary LLRs into binary ones.

In the next section we describe architectures implementing the decoding algorithms we covered so far.

5 Recursive Descriptions of Polar Code Decoders Hardware Architectures

In this section we study schematic architectures that are induced from the recursive decoding algorithms presented in Section 3. Indeed most of the algorithmic details were given in that section, therefore the purpose of our discussion here is to consider aspects of hardware algorithms, such as possible parallelism, scheduling and memory resources managements. Note, however, that throughout the discussion, our presentation is relatively abstract, emphasizing the important concepts and features of the recursive designs without dwelling into all the specifics. Consequently, the figures representing the block diagrams
should not be considered as full detailed specifications of the implementation, but rather as illustrations that aim to guide the reader in the task of designing the decoder.

Throughout this section we use the same notations for signals array and registers arrays. Let \( u(0 : N - 1) \) be an \( N \) length signals array. We denote its \( i^{th} \) component by \( u(i) \). If \( v \) is a two dimensional array (i.e. a matrix) of \( L \) rows and \( N \) columns, we denote it by \( v(0 : M - 1, 0 : N - 1) \). Naturally, the \( i^{th} \) row of this array is denoted by \( v(i, 0 : N - 1) \), and it is a one dimensional array of \( N \) elements, of which the \( j^{th} \) element is denoted by \( v(i, j) \).

5.1 Arikan’s Construction Decoders

This subsection covers architectures for Arikan’s \((u + v, v)\) construction. Generalizations of this discussion for other polar code types are presented in Subsection 5.2. We begin by the simple SC pipeline decoder (Subsection 5.1.2), and then proceed to the more efficient SC line decoder (Subsection 5.1.3). Both of these designs were previously presented by Leroux et al. \([14, 15]\) in a non-recursive fashion. We conclude by introducing a BP line decoder (Subsection 5.1.6).

5.1.1 The Processing Element

The basic computation element of the decoding circuits, described in Subsections 5.1.2 and 5.1.3, is the processing element (PE). Figure 13 depicts the PE block. Note that throughout Subsection 5.1 we use thick arrows to designate signals corresponding to real numbers (to be represented by some quantization method) and thin arrows to designate binary signals. The PE block has three inputs:

- \( \lambda(0 : 1) \) - an array of two input LLRs.
- \( \hat{u}^{(in)} \) - an estimation of the "\( u \)" bit from the coded pair \((u + v, v)\).
- \( c_u \) - a binary control signal determining the type of LLR that the circuit gives as output in \( \lambda^{(out)} \):
  - \( c_u = 0 \) means that we calculate the LLR of \( u \) and \( c_u = 1 \) means that we calculate the LLR of \( v \) given the estimation of \( u \) (the input signal \( \hat{u}^{(in)} \)).

The circuit outputs the LLR of \( u \) or \( v \) depending on the control signal \( c_u \)

\[
\lambda^{(out)} = \begin{cases} 
2 \tanh^{-1} \left( \tanh \left( \lambda(0)/2 \right) \tanh \left( \lambda(1)/2 \right) \right), & c_u = 0; \\
(-1) \hat{u}^{(in)} \cdot \lambda(0) + \lambda(1), & c_u = 1.
\end{cases}
\] (71)
5.1.2 The SC Pipeline Decoder

Figure 15 contains a block description of the SC pipeline decoder. The decoder’s signals $\lambda(0:N-1)$, $z(0:N-1)$, $\hat{u}(0:N-1)$ and $\hat{x}(0:N-1)$ correspond to the inputs and outputs of the SCDecoder function (17) $\lambda$, $z$, $\hat{u}$ and $\hat{x}$, respectively. For code length $N = 2$ bits, the SC decoder includes a single PE and a slicer. It operates according to Algorithm 6.

A block diagram of the implementation of this decoder for $N > 2$ is depicted in Figure 15. Scanning the diagram from right to left we can observe the following ingredients. The $\lambda(0:N-1)$ LLR input to the circuit is given as input to an array of $N/2$ PEs, $\{PE_j\}_{j=0}^{N/2-1}$, which all of them are controlled by the same control signal, $c_u^{\text{internal}}$. The output of these PEs is denoted by the array of signals $\Lambda(0:N-1)$ and stored in an array of $N/2$ registers $R(0:N/2-1)$ (depicted as rectangle blocks with the register names, $R(i)$, written in them). These registers are given as the LLR input to a SC pipeline decoder of length $N/2$ bits. This decoder is referred to as the embedded $N/2$ length decoder within the $N$ length decoder.

The embedded decoder is also given as input the frozen bits indicator signals $\tilde{z}(0:N/2-1)$ (binary array), which is generated by splitting the $z(0:N-1)$ binary array into two halves using the MUX array (M0a). The multiplexers in (M0a) are controlled by the internal binary signal outerCodeID that indicates the ordinal of the outer-code that the embedded decoder decodes. For instance, if outerCodeID = 0 then the embedded decoder handles the first outer-code and therefore it should be given as input the first half of the $z$ array. The two outputs of the embedded decoder are denoted by signals arrays $\tilde{u}(0:N/2-1)$ and $\tilde{x}(0:N/2-1)$. The array $\tilde{u}(0:N/2-1)$ is given as input to the two halves of the output decoded information bits array $\hat{u}(0:N-1)$. The DeMUX array (M0b) determines to which part of the $\hat{u}$ array $\tilde{u}$ is written.

The Encoding Unit performs the encoding of the outer-code’s estimated codewords into the estimated codeword of the $N$ length code. The binary register $\text{tmp}\hat{x}(0:N-1)$ stores the temporary value of the estimated codeword $\hat{x}$ which is the signals array at its input. The encoding layer is given as input the outerCodeID signal and the two signals arrays $\tilde{x}(0:N/2-1)$ and $\text{tmp}\hat{x}(0:N-1)$. Its output is derived
Figure 15: Block diagram for the SC pipeline decoder

![Block diagram for the SC pipeline decoder](image)

\[
\hat{x}(2j : (2j + 1)) = \begin{cases} 
[x(j), 0], & \text{outerCodeID} = 0; \\
[x(j) + \text{tmp}(2j), \hat{x}(j)], & \text{outerCodeID} = 1 
\end{cases}, \quad \forall j \in \left[\frac{N}{2}\right].
\]  

(72)

Note that in order to avoid delays due to sampling by registers, it is important that the codeword estimation (which is one of the outputs of the decoder) will be the output of the encoding layer and not the register following it. This issue and further timing concerns are considered in the next subsection.

We describe the recursive schematic decoding procedure for \( N > 2 \) in Algorithm 19. Let us consider the complexity of this circuit. We assume that a PE finishes its operation in one clock cycle. Denote by \( T(n) \) the time (in terms of the number of clock cycles) that is required to complete the decoding of \( N = 2^n \) length polar code. Then, \( T(n) = 2 + 2 \cdot T(n - 1) \) \( n > 1 \) and \( T(1) = 2 \). This recursion yields \( T(n) = 2N - 2 \). Denote by \( P(n) \) the number of PEs for a decoder of length \( N = 2^n \) bits polar code, we have \( P(n) = 2^{n-1} + P(n - 1) \) \( n > 1 \) and \( P(1) = 1 \), resulting in \( P(n) = 2^n - 1 = N - 1 \). The cost of the encoding unit is of \( 2 \cdot \sum_{i=1}^{n} 2^i = 4 \cdot (N - 1) \) bits registers, and \( \sum_{i=0}^{n-1} 2^i = N - 1 \) xor circuits. We should have \( \rho(n) \) registers for holding LLR values, so \( \rho(n) = 2^{n-1} + \rho(n - 1) \) \( n > 1 \) and \( \rho(1) = 0 \), so \( \rho(n) = N - 2 \). Note, that in this design, we assume that the encoding layer is a combinatorial circuit.

5.1.3 The SC Line Decoder

In the decoder pipeline design of length \( N \) polar code, the \( N/2 \) processing elements \( \{PE_j\}_{j=0}^{N/2-1} \), are only employed during steps 0 and 2 of the algorithm. During the other steps (that ideally consume \( 2 \cdot T(n - 1) = 2N - 4 \) clock cycles of the total \( 2N - 2 \) clock cycles) these processors are idle, resulting in an inefficient design. In order to increase the processors utilization we observe that the maximum number of operations that can be done in parallel by the PEs in the SC decoding algorithm is \( N/2 \). As
Algorithm 19 SC Pipeline Decoder of Length $N (u + v, v)$ Polar Code

//STEP 0:
▷ Set $c_u^{\text{(internal)}} = outerCodeID = 0$.
Using the PEs array $\{PE_j\}_{j=0}^{N/2-1}$, prepare the LLRs input array for the embedded decoder of the first $N/2$ length outer-code and output it on the signals array $\Lambda(0 : N/2 - 1)$, such that

$$\Lambda(j) = 2 \tanh^{-1} \left( \tanh(\lambda(2j)/2) \tanh(\lambda(2j + 1)/2) \right), \quad j \in [N/2]_-$$.  

Sample the $\Lambda(0 : N/2 - 1)$ array by the registers array $R(0 : N/2 - 1)$. Sample the first half of the frozen bits indicator $\hat{z}$ by the $\hat{\tilde{z}}$ register, i.e. $\hat{\tilde{z}}(0 : N/2 - 1) = \tilde{z}(0 : N/2 - 1)$.

//STEP 1:
▷ Execute the embedded decoder on $R(0 : N/2 - 1)$ and $\hat{\tilde{z}}(0 : N/2 - 1)$.
▷ Sample the $\hat{\tilde{u}}(0 : N/2 - 1)$ output array by the first half of $\tilde{u}$, i.e. $\hat{\tilde{u}}(0 : N/2 - 1) = \tilde{u}(0 : N/2 - 1)$.
Sample the $\hat{x}(0 : N/2 - 1)$ output array by the $x^{(\text{outer})}(0 : N/2 - 1)$ register, i.e. $x^{(\text{outer})}(0 : N/2 - 1) = \hat{x}(0 : N/2 - 1)$.
Let the Encoding Unit process $\tilde{x}(0 : N/2 - 1)$ according to (72).

//STEP 2:
▷ Set $c_u^{\text{(internal)}} = outerCodeID = 1$.
Using the PEs array $\{PE_j\}_{j=0}^{N/2-1}$, prepare the LLRs input array for the embedded decoder of the second $N/2$ length outer-code and output it on the signals array $\Lambda(0 : N/2 - 1)$, such that

$$\Lambda(j) = (-1)^{x^{(\text{outer})}(j)} \lambda(2j) + \lambda(2j + 1), \quad j \in [N/2]_-.$$  

Sample the $\Lambda(0 : N/2 - 1)$ array by the registers array $R(0 : N/2 - 1)$. Sample the second half of the frozen bits indicator $z$ by the $\hat{z}$ register, i.e. $\hat{\tilde{z}}(0 : N/2 - 1) = \tilde{z}(N/2 : N - 1)$.

//STEP 3:
▷ Execute the embedded decoder on $R(0 : N/2 - 1)$ and $\hat{\tilde{z}}(0 : N/2 - 1)$.
▷ Sample the $\hat{\tilde{u}}(0 : N/2 - 1)$ output array by the second half of $\tilde{u}$, i.e. $\hat{\tilde{u}}(N/2 : N - 1) = \tilde{u}(0 : N/2 - 1)$.
Let the Encoding Unit process $\tilde{x}(0 : N/2 - 1)$ according to (72).
a consequence, in order to support the maximum level of parallelism, the design has to include at least $N/2$ PEs. The line decoder\footnote{Note that strictly speaking, the original line decoder, presented by Leroux et al.\cite{15} Section 3.3], is not precisely the same design, discussed here. The differences, however, appear to be minor (existing mostly in the routing between the LLR registers and the PEs). As a consequence we preferred not to distinguish it from Leroux’s design.} that we describe in this subsection, achieves this lower-bound.

Figure 14 depicts the line decoder block for length $N$ bits code. The line decoder has two operation modes.

**Standard Mode (S-Mode):** $modeIn = 0$

The decoder gets as input LLRs array, $\lambda(0 : N-1)$, and the frozen bits indicator vector, $z(0 : N-1)$. Upon completion of its operation the decoder outputs the hard decision on the information word $\hat{u}(0 : N-1)$ and its corresponding codeword $\hat{x}(0 : N-1)$ (this is the operation mode we supported thus far in the pipeline decoder).

**PE-Array Mode (P-Mode):** $modeIn = 1$

The decoder gets as input a signals array of LLRs $\lambda(0 : N-1)$, a control signal $c_u^{(in)}$ and a binary vector $\hat{u}^{(in)}(0 : N/2 - 1)$. The output is a signals array $\lambda^{(out)}(0 : N - 1)$ of LLRs, where

$$
\lambda^{(out)}(j) = \begin{cases} 
2 \cdot \tanh^{-1} \left( \tanh \left( \frac{\lambda(2j)}{2} \right) \cdot \tanh \left( \lambda(2j + 1)/2 \right) \right), & c_u^{(in)} = 0; \\
(-1)^{\hat{u}^{(in)}(j)} \cdot \lambda(2j) + \lambda(2j + 1), & c_u^{(in)} = 1,
\end{cases} \quad \forall j \in \left[ \frac{N}{2} \right].
$$

(73)

In Figure 16 we provide a block diagram for this decoder. Note, that in order to maintain the maximum level of parallelism, the length $N$ polar code decoder ought to have $N/2$ processors. Thus, in order to build the length $N$ polar code decoder using an embedded $N/2$ length polar code decoder (already having $N/4$ processors), we use an additional array of $N/4$ PEs, which is referred to as the auxiliary array. The input signal $modeIn$ indicates whether the decoder is used in S-Mode or in P-Mode. The $modeIn$ signal is an internal signal that controls whether the $N/2$ length embedded decoder is in P-Mode.

Let us scan Figure 16 from right to left and observe its important ingredients. The auxiliary PEs array contains $N/4$ processors $\{PE_j\}_{j=0}^{N/4-1}$ to which the second half of input array $\lambda(N/2 : N - 1)$ is connected. The first half of the input LLRs array $\lambda(0 : N/2 - 1)$ is connected to the embedded line decoder via the MUX array (M2), in which all the multiplexers are controlled by the binary signal $c_m$. The other input alternative of the (M2) array is the registers array $R(0 : N/2 - 1)$. The $c_u$ input of $\{PE_j\}_{j=0}^{N/4-1}$ is determined by the (M3) multiplexer, such that in S-Mode ($modeIn = 0$) the input is $c_u^{(internal)}$ (an internal signal) and otherwise $c_u = c_u^{(in)}$ (one of the inputs to the length decoder). The output of (M3) also serves as the $c_u^{(in)}$ input to the embedded decoder. The $modeIn$ signal further controls the (M4) MUX array, such that in S-Mode the $\hat{u}^{(in)}$ input to $\{PE_j\}_{j=0}^{N/4-1}$ is the input sub-vector $\hat{u}^{(in)}(N/4 : N/2 - 1)$ and otherwise the input is $x^{(outer)}(N/4 : N/2 - 1)$ (the second half of the estimated codeword output of the embedded decoder). Furthermore, $modeIn$ also controls the (M1) MUX array that selects between $\hat{u}^{(in)}(0 : N/4 - 1)$ and $x^{(outer)}(0 : N/4 - 1)$ for $modeIn = 1$ and $modeIn = 0$ respectively. The output of (M1) serves as the $\hat{u}^{(in)}$ input of the embedded decoder. The internal binary signal, $modeIn$, is given to the embedded decoder as its $modeIn$ input.

The S-Mode and the P-Mode procedures of the line decoder are described in Algorithms 20 and 21 respectively. Let us discuss the complexity of the decoder. Let $P(n)$ be the number of processors of the $N = 2^n$ decoder. Then, $P(n) = 2^{n-2} + P(n-1)$, $P(1) = 1$, so $P(n) = 2^{n-1} = N/2$. The number of LLR registers is $\rho(n) = 2^{n-1} + \rho(n-1)$, $\rho(1) = 1$, so we have $\rho(n) = 2^n - 1 = N - 1$. Note that $\rho(n)$ doesn’t account for the binary registers for $\hat{z}$, $tmpx$ and $\hat{u}$.

At this point, we would like to make a remark regarding the efficiency of the proposed design. The recursive design has the benefit of being a comprehensible reflection of the implemented algorithm. It also has the advantage of emphasizing the parts of the system that may be reused. However, it might be argued that it has a disadvantage considering the routing of signals in the circuit. This is because we use
Figure 16: Block diagram for the SC line decoder
Algorithm 20 S-Mode of SC Line-Decoder of Length $N$ \((u + v, v)\) Polar Code \((\text{modeIn} = 0)\)

//STEP 0:
\(\triangleright\) Set \(c_m = c_u^{(\text{internal})} = \text{outerCodeID} = 0, \text{mode} = 1.\)

Operate the embedded decoder in P-Mode, such that at the output of the decoder we have
\[
\Lambda(j) = 2 \cdot \tanh^{-1}(\tanh(\lambda(2j)/2) \cdot \tanh(\lambda(2j + 1)/2)) \quad \forall j \in [\lfloor N/4 \rfloor].
\]

Use the auxiliary PEs array and compute
\[
\Lambda(j) = 2 \cdot \tanh^{-1}(\tanh(\lambda(2j)/2) \cdot \tanh(\lambda(2j + 1)/2)) \quad \forall N/4 \leq j \leq N/2 - 1
\]

Sample the \(\Lambda(0 : N/2 - 1)\) array by the registers array \(R(0 : N/2 - 1)\). Sample the first half of the frozen bits indicator \(z\) by the \(\tilde{z}\) register, i.e. \(\tilde{z}(0 : N/2 - 1) = z(0 : N/2 - 1)\).

//STEP 1:
\(\triangleright\) Set \(\text{mode} = 0\) and \(c_m = 1.\)

Operate the embedded decoder in S-Mode on \(R(0 : N/2 - 1)\) and \(\tilde{z}(0 : N/2 - 1)\).
\(\triangleright\) Sample the \(\tilde{u}(0 : N/2 - 1)\) output array by the first half of \(\hat{u}\), i.e. \(\tilde{u}(0 : N/2 - 1) = \hat{u}(0 : N/2 - 1)\).

Sample the \(\tilde{x}(0 : N/2 - 1)\) output array by the \(x^{(\text{outer})}(0 : N/2 - 1)\) register, i.e. \(x^{(\text{outer})}(0 : N/2 - 1) = \tilde{x}(0 : N/2 - 1)\). Let the Encoding Unit process \(\tilde{x}(0 : N/2 - 1)\) according to (72).

//STEP 2:
\(\triangleright\) Set \(c_m = 0, c_u^{(\text{internal})} = \text{mode} = \text{outerCodeID} = 1.\)

Operate the embedded decoder in P-Mode, such that at the output of the decoder we have
\[
\Lambda(j) = (-1)^{x^{(\text{outer})}(j)} \cdot \lambda(2j) + \lambda(2j + 1) \quad \forall j \in [\lfloor N/4 \rfloor].
\]

Use the auxiliary PEs array and compute
\[
\Lambda(j) = (-1)^{x^{(\text{outer})}(j)} \cdot \lambda(2j) + \lambda(2j + 1) \quad \forall N/4 \leq j \leq N/2 - 1
\]

Sample the \(\Lambda(0 : N/2 - 1)\) array by the registers array \(R(0 : N/2 - 1)\). Sample the second half of the frozen bits indicator \(z\) by the \(\tilde{z}\) register, i.e. \(\tilde{z}(0 : N/2 - 1) = z(N/2 : N - 1)\).

//STEP 3:
\(\triangleright\) Set \(\text{mode} = 0\) and \(c_m = 1.\)

Operate the embedded decoder in S-Mode on \(R(0 : N/2 - 1)\) and \(\tilde{z}(0 : N/2 - 1)\).
\(\triangleright\) Sample the \(\tilde{u}(0 : N/2 - 1)\) output array by the second half of \(\hat{u}\), i.e. \(\tilde{u}(N/2 : N - 1) = \hat{u}(0 : N/2 - 1)\). Let the Encoding Unit process \(\tilde{x}(0 : N/2 - 1)\) according to (72).
\begin{algorithm}
\caption{P-Mode of SC Line-Decoder of Length $N$ $(u + v, v)$ Polar Code ($\text{modeIn} = 1$)}

\> Set $c_{in} = 0$, $mode = 1$.

Operate the embedded decoder in P-Mode, such that at the output of the decoder we have

\[\Lambda(j) = \begin{cases} 
2 \cdot \tanh^{-1} (\tanh (\lambda(2j)/2) \cdot \tanh (\lambda(2j + 1)/2)) & \text{if } c_{in}^{(in)} = 0; \\
(\lambda(2j)/2) \cdot \lambda(2j + 1) & \text{if } c_{in}^{(in)} = 1; \\
(\lambda(2j)/2) \cdot \lambda(2j + 1) & \forall j \in [N/4 - 1].
\end{cases}\]

Use the auxiliary PEs array and compute

\[\Lambda(j) = \begin{cases} 
2 \cdot \tanh^{-1} (\tanh (\lambda(2j)/2) \cdot \tanh (\lambda(2j + 1)/2)) & \text{if } c_{in}^{(in)} = 0; \\
(\lambda(2j)/2) \cdot \lambda(2j + 1) & \text{if } c_{in}^{(in)} = 1; \\
(\lambda(2j)/2) \cdot \lambda(2j + 1) & \forall N/4 \leq j \leq N/2 - 1.
\end{cases}\]

//Note that the signals array $\Lambda(0 : N/2 - 1)$ is wired to the output signals array $\lambda^{(out)}(0 : N/2 - 1)$.

\end{algorithm}

the embedded decoder as a black box and consequently we route all the signals from it and to it, using its interface. As a result, some of the signals traverse lengthy paths before reaching their target processor. These paths may be too long for the decoder circuit to have an adequate clock frequency, thereby resulting in degradation of the achievable throughput. We therefore recommend that after constructing the circuit in a recursive manner, it should be optimized by unfolding the recursive units and contracting the paths. Furthermore, we advise that for building a decoder of length $2^N$ bits code, the designer will use the already optimized design of the $N$ length decoder (for the embedded unit), thereby taking advantage of the recursion.

We give below two examples of long paths hazards, that are likely to pose a problem. Workarounds for these challenges are further provided.

1. The (M2) MUX array at the input of the embedded line decoder of the length $N/2$ code was included because of the introduction of P-Mode. A closer examination of our design, reveals that some of (M2) input signals traverse long paths before reaching their destination PE. For example, the inputs $\lambda(0)$ and $\lambda(1)$ need to traverse $\log_2(N) - 1$ multiplexer layers before reaching their processor. Since P-Mode needs to be accomplished in a single clock cycle, this long path might be prohibitive. By unfolding the $N/2$ length embedded decoder block, the designer is able to control the lengths by carefully routing the signals.

2. The encoding layer also suffers from long routing. In our analysis, we assumed that the encoding procedure is combinatorial, and therefore has to be completed within one clock cycle. This may be a problem when several encoding circuits are operated one after the other. For instance, this is the case of step 3 of the decoder of length $N/2^i$ code, that occurs within step 3 of the decoder of length $N/2^{i-1}$ code for all $i \in [2N - 2]$. In this case, $O(\log N)$ operations need to occur in a sequential manner in one clock cycle. For large $N$ and high clock frequency circuit, this might not be feasible. The idea of Leroux et al. [15] was to use flip-flops for saving the partial encoding for each code bit in the different layers of the decoding circuit. Each such flip-flop, is connected using a xor circuit to the signal line of the estimated information bit. As such, whenever the SC decoder decides on an information bit, the flip-flops corresponding to the code bits that are dependent on this information bit are updated accordingly. These flip-flops need to be reset whenever we start decoding their corresponding outer-code. For example, when we start using the embedded $N/2$ length decoder (on step 1 and step 3) its flip-flops of partial encoding need to be erased (because they correspond to a new instance of outer-code).

The above notion may also be described recursively, by changing the specification of the length.
N polar code decoder in S-mode, and requiring it to output the estimated information bits as soon as they’re ready. The decoder should also have an N length binary indicator vector, that indicates which code bits are dependent on the currently estimated information bit. It is easy to see that using the indicator vector of the length N/2 decoder, it is possible to calculate the N length indicator vector, by using the (u + v, v) mapping. This, however, generates again a computation path of length Θ(log N). This problem, can be addressed, by having a fixed indicator circuit for each partially encoded-bit flip-flop. This circuit will indicate which information bit should be accumulated depending on the ordinal number of this bit. For example, for the decoder of the code of length N, we should have an array of N/2 flip-flops, each one corresponds to a bit of the codeword of the N/2 length first outer-code. Each one of these flip-flops, should have an indicator circuit, that gets as input a value of a counter signaling the ordinal number of the information bit that has been estimated, and returns 1 iff its corresponding codeword bit is influenced by this information bit. For example, the indicator circuit, corresponding to the first code bit, is a constant 1, because \( x_0 = \sum_{i=0}^{N/2-1} u_i \), i.e. it is dependent on all the information bits. On the other hand, the last bit’s indicator (i.e. of \( x_{N/2-1} \)) returns 1 iff its input equals to N/2 – 1, because \( x_{N/2-1} = u_{N/2-1} \). Using the global counter (that is advanced whenever an information bit is estimated) and the indicator circuits, each code bit that is influenced by this information bit change its flip-flop state accordingly.

Using the Kronecker power form of the generating matrix of the (u + v, v) polar code, it can be seen that each of such indicator circuits can be designed by using no more than \( O(\log n) = O(\log \log N) \) AND and NOT circuits, therefore the total cost of these circuits will be of \( O(N \log \log N) \) in terms of space complexity. Further improvements to the efficiency of the circuit can be achieved by employing Fan and Tsui’s high performance partial sum network \( [34] \). This network implements the indicator circuits with constant space complexity and delay (per circuit).

In summary, the recursive architecture may be developed and modified to achieve the timing requirements of the circuit. This may be done by “opening the box” of the embedded decoders, and altering them to support more efficient designs.

A careful examination of the line-decoder reveals that the auxiliary PEs array is only used on steps 0 and 2, and is idle on the other steps. This fact motivates us to consider two variations on this design. The first one adds hardware and use these arrays to increase the throughput, while the second one decreases the throughput and thereby reduces the required hardware.

5.1.4 Parallel Decoding of Multiple Codewords

High throughput communication systems may require support of simultaneous decoding of multiple codewords. A naive approach to meet this challenge is implementing p instances of the decoder when there is a need for decoding p codewords simultaneously. However, because the PEs auxiliary array is idle most of the time, it seems like a good idea to “share” this array among several decoders. By appropriately scheduling the commands to the processors, it is possible to have a decoder implementation for p parallel codewords which is less expensive than just duplicating the decoders.

Since the array is idle during steps 1 and 3, in which the embedded length N/2 decoder is active, it is possible to have \( p \leq T(n – 1) + 1 = N – 1 \) decoders sharing the same auxiliary array. The decoding of each one of them is issued in a delay of one clock cycle from each other. Assuming that \( p = N – 1 \), we have a decoding time \( T(n) + N – 2 = 3N – 4 \) for \( N – 1 \) codewords while having \( p \cdot P(n – 1) + N/4 = (N – 1) \cdot N/4 + N/4 = N^2 \cdot 4 \) processors, which is about half of the number of processors of the naive solution.

This notion can be further developed. For the embedded N/2 length decoder, there is a an auxiliary array of N/8 processors. This auxiliary array is used on steps 0 and 2 of the decoders of length N and length N/2. Therefore, it is idle most of the time, and we can share it among the p decoders of length N/2. Assuming that \( p = N – 1 \), we may allocate three auxiliary arrays that will be shared among the decoders, each one is dedicated to one of these different steps: one array for step 0 (and 2) of the N length decoder, one array for step 0 of the N/2 length decoder and one array for step 2 of the N/2 length
decoder. For each of the decoded codewords the number of clock cycles between these steps is at least \( p \), therefore there will be no contention on these resources and the throughput will not suffer because of this hardware reduction.

In general, for \( p = N - 1 \), the auxiliary array within the embedded decoder of length \( \frac{N}{2} \) polar decoder \((i \in [\log_2(N) - 2])\), can be shared among the \( p \) decoders, provided that we allocate an instance of the array for each of the decoding steps it is used in, during the first half of the decoding algorithm for the length \( N \) code (i.e. during the \( N \) length decoder’s steps 0 and 1). As a consequence, for this specific array, we have one call in step 0 of the \( N \) length decoder, one call for step 0 and one call for step 2 of the embedded \( \frac{N}{2} \) length decoder, two calls for step 0 and two calls for step 2 of the \( \frac{N}{2} \) length embedded decoder, ..., \( 2^i \) calls for step 0 and \( 2^i \) calls for step 2 for the length \( \frac{N}{2} \) embedded decoder.

In summary, we require \( \sum_{i=0}^{\log_2(N) - 2} 2^{i+1} = 2^0 + 2^1 + \ldots + 2^{\log_2(N) - 1} \) auxiliary arrays of processors, each one contains \( \frac{N}{2^{i+1}} \) PEs. In particular, we need \( N - 1 \) PEs for the length 2 decoder (each PE is allocated to a specific decoder), and \( \frac{N}{2} \cdot \sum_{i=0}^{\log_2(N) - 2} \frac{2^{i+1}}{2^{i+1}} \approx \frac{N}{2} (\log_2(N) - 1) \) PEs for the other decoders lengths. This adds up to approximately \( \frac{N}{2} (1 + \log_2(N)) \) PEs. We conclude that this solution allows an increase of the throughput in a multiplicative factor of \( N \), while the PEs hardware is only increased by approximately \( \log_2(N) \) factor.

Note, that the number of registers should be increased by a multiplicative factor of \( O(p) = O(N) \) as well.

A closer look at the above design, reveals that we actually allocated for each sub-step of steps 0 and 1 of the \( N \) length decoder a different array of processors. The decoding operations of the \( p \) codewords will go through these units in a sequential order. However, each decoder should have its own set of registers saving the state of the decoding algorithm. Another observation is that when we finish decoding the first codeword (i.e. the one we started decoding in time 0), we can start decoding codeword number \( N \) in the next time slot (and then codeword number \( N + 1 \), etc.), in a pipelined fashion. Note that Leroux et al. considered a similar idea, and referred to it as the vector-overlapping architecture \([14]\).

### 5.1.5 Limited Parallelism Decoding

An alternative approach for addressing the problem of low utilization of the auxiliary PEs arrays is to limit the number of processing elements that may be allowed to operate simultaneously. This is a practical consideration, since typically, a system design has a parallelism limitation which is due to power consumption and silicon area constraints. The limited parallelism, inevitably results in an increase of the decoding time, and thereby a decrease of the throughput.

The length \( N \) line decoder has PE parallelism of \( N/2 \), because it may simultaneously compute at most \( N/2 \) LLRs using the \( N/2 \) PEs. Let us consider a line decoder of length \( N \) code with limited parallelism of \( N/2^i \), where \( i \in [\log_2(N)] \). This means, that the decoder has exactly \( \frac{N}{2^i} \) PEs. If \( i = 1 \) then the decoder is actually the standard line decoder. Figure \([17]\) depicts the block diagram of the decoder for \( i > 1 \). We highlight the changes that were applied to the standard line decoder (Figure \([18]\) in creating Figure \([17]\):

- The auxiliary PEs array was omitted.
- The embedded line decoder of the \( N/2 \) length code was replaced by a limited parallelism line decoder, with parallelism of \( N/2^i \).
- The input to the registers array \( R(N/4 : N/2 − 1) \) is the signals array \( \Lambda(0 : N/4 − 1) \).
- A MUX array (M2a) was added providing the "channel" inputs to the (M2) MUX array. The control signal of the (M2a) array is an internal binary signal called subStep, such that the output of the array is \( \lambda(0 : N/2 − 1) \) if \( \text{subStep} = 0 \) and otherwise it equals to \( \lambda(N/2 : N − 1) \). Similarly, subStep is the control signal of two additional MUX arrays (M1a) and (M1b) providing inputs to the (M1) MUX array. We have the outputs of these arrays equal \( \hat{u}^{(\text{in})} (0 : N/4 − 1) \) and \( x^{(\text{outer})} (0 : N/4 − 1) \) for \( \text{subStep} = 0 \) and \( \hat{u}^{(\text{in})} (N/4 : N/2 − 1) \) and \( x^{(\text{outer})} (N/4 : N/2 − 1) \) otherwise.
- The output LLR signals array \( \lambda^{(\text{out})} (0 : N/2 − 1) \) is routed such that
  \[
  \lambda^{(\text{out})} (0 : N/4 − 1) = R(0 : N/4 − 1) \text{ and } \lambda^{(\text{out})} (N/4 : N/2 − 1) = \Lambda(0 : N/4 − 1).
  \] (74)
The limited parallelism S-mode decoding algorithm has four steps as before, however steps 0 and 2 are modified including now two sub-steps. On each sub-step we calculate half of the LLRs because we don’t have an auxiliary array. Note that depending on the parallelism of the embedded decoder, those sub-steps may require more than one clock cycle. In a similar manner the P-mode operation is also amended, and now contains two sub-steps.

Let us analyze the time complexity of this algorithm. We denote by \( T(n,n-i) \) the S-Mode running time (in terms of clock cycles) for length \( N = 2^n \) bits polar code with limited parallelism of \( N/2^i = 2^{n-i} \). We note that \( T(n,n-1) = T(n) \), where \( T(n) = 2N - 2 \) is the time complexity of the standard line decoder. The following recursion formula is derived

\[
T(n,n-i) = 2 \cdot T(n-1,n-i) + 4 \cdot T_p(n-1,n-i),
\]

where \( T_p(n,m) \) is the running time of the \( N = 2^n \) bits length decoder with \( 2^m \) limited parallelism in P-Mode.

\[
T_p(n,m) = \begin{cases} 
1, & n - m \leq 1; \\
2 \cdot T_p(n-1,m), & \text{otherwise}.
\end{cases}
\]

Therefore,

\[
T_p(n,m) = \begin{cases} 
1, & n - m \leq 1; \\
2^{n-m-1}, & \text{otherwise}.
\end{cases}
\]

It can be shown that

\[
T(n,n-i) = 2 \cdot N + (i-2) \cdot 2^i, \quad i \geq 1.
\]

Equation (75) reveals the tradeoff between the number of PEs and the running time of the algorithm. For example, decreasing the number of processors by a multiplicative factor of 8, compared to the standard case (i.e. \( i = 4 \)), results in an increase of only 34 clock cycles in the decoding time. We note however, that in order to implement such a decoder, additional routing circuitry (e.g. multiplexers layers) should be included.

**Remark 3 (SCL Implementation)** For a limited list size, the SCL decoder may also be implemented by a line decoder. This requires to duplicate the hardware by the list size, \( L \), and to introduce the appropriate logic (i.e. comparators and multiplexer layers). It is possible to provide an implementation with \( O(f(L) \cdot N) \) time complexity, where \( f(\cdot) \) is a polynometrically bounded function, that is dependent on the efficiency of algorithms for selection of \( L \) most likely decoding paths from a list of \( 2L \) paths (which is done by the \( N = 2^n \) length decoder). Furthermore, the normalization of the likelihoods should be carefully considered, and also should have its impact on the precise (i.e. non asymptotic) time complexity. As was mentioned in Subsection 4.1.5 by limiting the parallelism of the decoder, it is possible to reduce the number of processors with reasonable hit to the throughout.

### 5.1.6 The BP Line Decoder

As we already noticed in Subsection 4.3 BP is an iterative algorithm, in which messages are sent on the normal factor graph representing the code. In this subsection, we consider an implementation of the BP decoder that employs the GCC serial schedule. Figure 18a depicts the proposed design processing element (PE). This unit has two inputs for message LLRs \((\mu_0^{(in)} \text{ and } \mu_1^{(in)})\), and depending on the control signal \( c^{(BPPE)} \) it performs either the \( f_{(+)}(\cdot, \cdot) \) function or the \( f_{(=)}(\cdot, \cdot) \), i.e.

\[
\mu^{(out)} = \begin{cases} 
 f_{(=)} \left( \mu_0^{(in)}, \mu_1^{(in)} \right), & c^{(BPPE)} = 0; \\
f_{(+)} \left( \mu_0^{(in)}, \mu_1^{(in)} \right), & c^{(BPPE)} = 1.
\end{cases}
\]

Since the PE has to support the implementation of equations (47)-(49), we introduce two routing layers for the inputs (OP-MUX) and the outputs (OP-De-MUX) that ensure that the proper inputs are given
Figure 17: Block diagram for the limited parallelism line decoder
to the processor and that its output is dispatched to the appropriate destination. These routing units are controlled by two control signals \( c_{opMux} \) and \( c_{opDeMux} \) which have seven possible values, and is thereby represented by three bits. Table 1 specifies the valid assignments of \( c_{BP P E} \), \( c_{opMux} \) and \( c_{opDeMux} \) for implementing different operations. The last option \( (c_{opMux} = c_{opDeMux} = 6) \) is used in the decoder’s P-Mode, that is defined in the sequel.

The proposed decoder structure is inspired by the recursive structure of the SC line decoder. Figure 18 depicts the BP line decoder block. Similarly to the SC line decoder we specify two operation modes:

- S-Mode \((modeIn = 0)\): the decoder completes a single iteration of the BP decoder, given the inputs \( \lambda(0 : N - 1), z(0 : N - 1) \) and outputs \( \hat{u}(0 : N - 1) \) and \( \hat{z}(0 : N - 1) \) (defined in the BP signature, (52)).

- P-Mode \((modeIn = 1)\): the decoder serves as an array of \( N/2 \) processors and performs simultaneously a parallel computation on the input array \( \lambda(0 : N - 1) \) such that \( \forall i \in [N/2] - \) the output \( \lambda^{(out)}(i) \) is the outcome of applying the BP PE on inputs \( \lambda(i) \) and \( \lambda(i + N/2) \) with \( C^{(BP P E, in)} \) as the control signal.

Figure 19 contains a block diagram for this design. Due to the vast number of details in this figure, we chose to enlarge three parts of this figures, named sub-figures A, B and C, in Figures 20, 21 and 22, respectively. The memory plays a fundamental role in the design, as it enables storing messages within the iteration boundary and beyond it. The basic requirement is that each ”butterfly” realization of the \((u + v, v)\) factor graph, should have memory resources to store its messages. To allow messages to be kept within the iteration boundary, it is only required to have one registers array for each length of outer-code and for each message type. However, the need for keeping a message beyond the iteration boundary requires a dedicated memory array for each outer-code instance. Note that messages which their values

---

**Table 1: Routing tables for OP-MUX and OP-DEMUX in Figure 18**

<table>
<thead>
<tr>
<th>( c_{opMux} ) ( c_{opDeMux} )</th>
<th>( c_{BP P E} )</th>
<th>( \mu^{(in)} )</th>
<th>( \mu^{(out)} )</th>
<th>( \mu^{(in)} )</th>
<th>( \mu^{(out)} )</th>
<th>( \mu^{(in)} )</th>
<th>( \mu^{(out)} )</th>
<th>( \mu^{(in)} )</th>
<th>( \mu^{(out)} )</th>
<th>( \mu^{(in)} )</th>
<th>( \mu^{(out)} )</th>
<th>( \mu^{(in)} )</th>
<th>( \mu^{(out)} )</th>
<th>( \mu^{(in)} )</th>
<th>( \mu^{(out)} )</th>
<th>Equation</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1 1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2 2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>3 3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>4 4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>5 5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>6 6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
</tr>
</tbody>
</table>
Figure 19: Block diagram for the BP line decoder. Details of figure appear in Figures 20, 21 and 22 corresponding to sub-figures A, B and C respectively.
are calculated before being used for the first time in each iteration are not required to be kept beyond the iteration boundary. In the case of the \((u+v,v)\) code and the GCC schedule, only messages of type \(\mu_v^{(in)}\) need to be kept beyond the iteration boundary. We suggest to satisfy this requirement in the following way. In the length \(N\) decoder, we associate a registers matrix \(\mu_v^{(in)}(0 : \#_r(N) − 1, 0 : N/2 − 1)\). Here, \(\#_r(N)\) is the number of realizations of factor graphs corresponding to outer-codes of size \(N\) that exist in our code.

For the \(N\) bits length code, there is only one factor graph of this size (i.e. the entire graph), and therefore for this decoder \(\#_r(N) = 1\). Consider now the \(N/2\) bits length decoder that is embedded within the \(N\) length decoder. We see in Figure 19, that this decoder has its number of realizations as \(2 \cdot \#_r(N/2)\), i.e. for the \(N\) bits length decoder we have \(\#_r(N/2) = 2\). This is because we have two outer-codes of length \(N/2\) bits in the \(N\) length code. Therefore, the memory matrix associated with it has two rows and \(N/4\) columns. The first row is dedicated to the first realization of the outer-code and the second row is dedicated to the second realization. Within this \(N/2\) bits length decoder, there is an embedded \(N/4\) length decoder with \(2 \cdot \#_r(N/4)\) realizations, so in this case \(\#_r(N/4) = 4\). As a result, it has a registers matrix with 4 rows and \(N/8\) columns (each row is dedicated to one of the 4 outer-codes of length \(N/4\) in this GCC scheme). This development continues, until we reach the embedded decoder of length 2, which, by induction, has \(\#_r(2) = N/2\) realizations for the \(N\) length decoder, so it requires a registers matrix with \(N/2\) rows and one column.

For a correct operation of the decoder, it is required to inform the embedded decoders to which realization of the outer-code’s factor graph they are currently referring. This is the role of the realizationID input signal in Figure 19, that takes decimal values in \([\#_r(N)]\), and therefore requires \(\lceil \log_2(\#_r(N)) \rceil\) bits for their representation. Moving to the implementation in Figure 19 we can observe that indeed RealizationID is used to select the row of \(\mu_v^{(in)}\) corresponding to the outer-code realization that is currently processed. Furthermore, an internal signal RealizationID\((N/2)\) is defined as the RealizationID input of
Figure 21: Block diagram for the BP line decoder (Figure 19) - zoom-in: Sub-figure B

Figure 22: Block diagram for the BP line decoder (Figure 19) - zoom-in: Sub-figure C
the embedded $N/2$ length decoder, such that

$$RealizationID^{(N/2)} = 2 \cdot RealizationID + outerCodeID,$$  

(80)

where $outerCodeID \in \{0, 1\}$ indicates the ordinal of the $N/2$ bits length outer-code (of the current decoded length $N$ code) that is currently processed.

We also need to have registers arrays for the messages of type $\mu_{e_1 \rightarrow \alpha_0}, \mu_{\alpha_0 \rightarrow e_1}, \mu^{(in)}_u$ and $\mu^{(out)}_u$, each one of them of length $N/2$. We denote them by $\mu_{e_1 \rightarrow \alpha_0}(0 : N/2 - 1), \mu_{\alpha_0 \rightarrow e_1}(0 : N/2 - 1), \mu^{(in)}_u(0 : N/2 - 1), \mu^{(out)}_u(0 : N/2 - 1)$ and $\mu^{(out)}_v(0 : N/2 - 1)$, respectively. Note, that as opposed to the memory structure for the $\mu^{(in)}_v$ messages, these arrays do not need to be available beyond the iteration boundary, therefore it is sufficient to have them as arrays and not matrices. Furthermore, the arrays for messages $\mu_{e_1 \rightarrow \alpha_0}, \mu^{(out)}_u$ and $\mu^{(out)}_v$, can be replaced by a single temporary array of length $N/2$. However, in the description of the hardware structure, we chose not to do this, in order to keep the discussion more comprehensible.

The routing units OP-MUX and OP-De-MUX that appeared in Figure 18a were grouped together in Figure 19 into routing arrays (M3a), (M3b), (M4a) and (M4b). The inputs and outputs to these routing arrays are arrays of inputs and outputs corresponding to the types of inputs and outputs that appear in Figure 18a. The convention is that in these routing arrays, the $i^{th}$ output corresponds to the $i^{th}$ input from each signals array (the signals array is selected by the control signal of the routing array). Moreover, the $i^{th}$ output of the OP-MUX array corresponds to the consecutive $i^{th}$ processor from the array of processors it serves. Similarly, the $i^{th}$ input of the OP-De-MUX array corresponds to the $i^{th}$ consecutive processor from the array of processors it serves.

MUX arrays (M1a), (M1b), (M2a) and (M2b) are used to select the LLR inputs to the embedded decoder, $\hat{\lambda}(0 : N/2 - 1)$. The select signal $c_m$ determines if the inputs to the embedded decoder comes from the outputs of the OP-MUX arrays (M3a) and (M3b) if $c_m = 0$, or from the MUX-Arrays (M2a) and (M2b) if $c_m = 1$. We shall see that $c_m = 0$ is used when the embedded decoder is employed in S-Mode, while $c_m = 1$ is used when it is employed in P-Mode. The multiplexer (M5) selects the appropriate source for the $c^{(BPPE)}$ control signal, such that in S-Mode ($modeIn = 0$), $c^{(BPPE)}$ takes the internal $c^{(BPPE, \text{internal})}$ signal, and in P-Mode it takes the input signal $c^{(BPPE, \text{in})}$. Finally, note that the $\lambda(0 : N - 1)$ inputs signals array is wired both to $\mu^{(in)}_2(0 : N/2 - 1), \mu^{(in)}_1(0 : N/2 - 1)$ signals arrays (used in S-Mode) and to $\mu^{(ext,in)}_0(0 : N/2 - 1)$ and $\mu^{(ext,in)}_1(0 : N/2 - 1)$ (used in P-Mode). The $\mu^{(out)}_2(0 : N/2 - 1)$ and $\mu^{(out)}_1(0 : N/2 - 1)$ signals arrays are wired to the $\hat{x}(0 : N - 1)$ output signals array.

The S-Mode operation of the decoder is described in Algorithms 22 and 23. The P-Mode procedure is described in Algorithm 24.

Let us, now, consider the time complexity (in terms of the number of clock cycles for running an iteration) of this design. As before, let $T(n)$ be the time complexity of the decoder of length $N = 2^n$ bits polar code. We assume that each operation of the BP PE requires one clock cycle. As a consequence, we have

$$T(n) = 2 \cdot T(n - 1) + 7, \quad \text{for } n > 1$$  

(81)

and $T(1) = 4$, resulting in $T(n) = 5.5 \cdot N - 7 = \Theta(N)$. The memory consumption, however is $\Theta(N \cdot \log N)$, because of the memory matrices for the $\mu^{(in)}_v$ type of messages. The number of processing elements in this design is $N/2$. Note that our proposed PE can be further improved to support some PE operations occurring in parallel. For example, if the BP PE is designed such that the operation of $f_{(+)}(\cdot)$ and the operation of $f_{(=)}(\cdot)$ can be performed simultaneously in one clock cycle, we can execute the last two operations in step 3 in one clock cycle. Consequently, this will reduce the free addend in (81) to 6. Further reduction is possible if the processor can execute $f_{(+)}(\cdot)$ and direct its output to $f_{(=)}(\cdot)$ in one clock cycle. This improvement will result in joining the two operations in step 2, into one operation. Enabling the computation of $f_{(=)}(\cdot)$ and directing its output to $f_{(+)}(\cdot)$ in the same clock cycle, results in consolidation of the two operations of step 0 into one operation (actually, the latter change may also allow
Algorithm 22 S-Mode (Steps 0 and 1) of BP on Length $N$ $(u + v, v)$ Polar Code ($modeIn = 0$)

//STEP 0:
▷ Set $c_m = c^{(BPPE, internal)} = c^{(opMux)} = c^{(opDeMux)} = 0$, $mode = 1$.

Operate the embedded decoder in P-Mode, such that at the output of the decoder we have

$$
\mu_{e_1 \rightarrow a_0} (j) = f(\approx) \left( \mu_{x_1}^{(in)} (j) , \mu_{v}^{(in)} (j) \right) \quad \forall j \in \left[\frac{N}{4}\right].
$$

Use the auxiliary PEs array and compute

$$
\mu_{e_1 \rightarrow a_0} (j) = f(\approx) \left( \mu_{x_1}^{(in)} (j) , \mu_{v}^{(in)} (j) \right) \quad \forall \frac{N}{4} \leq j \leq \frac{N}{2} - 1.
$$

Store these messages in their designated memory array.

▷ Set $c_m = outerCodeID = 0, c^{(BPPE, internal)} = mode = 1, c^{(opMux)} = c^{(opDeMux)} = 2$.

Simultaneously operate the embedded decoder (P-Mode) and the auxiliary array and store their outputs in the memory area such that

$$
\mu_{u}^{(out)} (j) = f(+) \left( \mu_{x_0}^{(in)} (j) , \mu_{e_1 \rightarrow a_0} (j) \right) \quad \forall j \in \left[\frac{N}{2}\right].
$$

Sample the first half of the frozen bits indicator $z$ by the $\tilde{z}$ register, i.e. $\tilde{z}(0 : N/2-1) = z(0 : N/2-1)$.

//STEP 1:
▷ Set $mode = outerCodeID = 0, c_m = 1$.

Execute the embedded decoder in S-Mode on $\mu_{u}^{(out)} (0 : N/2 - 1)$ as the LLR input and $\tilde{z}(0 : N/2 - 1)$ as the frozen symbols indicator vector. The realization ID of the embedded decoder (denoted by $realizationID^{(N/2)}$) is calculated according to (80).

▷ Sample the $\tilde{u}(0 : N/2 - 1)$ signals array by the first half of $\tilde{u}$, i.e. $\tilde{u}(0 : N/2 - 1) = \tilde{u}(0 : N/2 - 1)$. Sample the $\tilde{x}(0 : N/2 - 1)$ signals array by the registers array $\mu_{u}^{(in)} (0 : N/2 - 1)$, i.e. $\mu_{u}^{(in)} (0 : N/2 - 1) = \tilde{x}(0 : N/2 - 1)$. 

54
Algorithm 23 S-Mode (Steps 2 and 3) of BP on Length $N$ $(u + v, v)$ Polar Code $(modeIn = 0)$

/\STEP 2:
\> Set $c_m = 0, c^{(BPPE_{\text{internal}})} = c^{(\text{opMux})} = c^{(\text{opDeMux})} = \text{mode} = 1$.
Simultaneously operate the embedded decoder (P-Mode) and the auxiliary array and store their outputs in the memory area such that

$$\mu_{a_0 \rightarrow e_1}(j) = f_{(+)} \left( \mu_{x_0}^{(in)}(j), \mu_{v}^{(in)}(j) \right) \quad \forall j \in [N/2]_{-}.$$  

\> Set $c_m = c^{(BPPE_{\text{internal}})} = 0, \text{mode} = \text{outerCodeID} = 1, c^{(\text{opMux})} = c^{(\text{opDeMux})} = 3$.
Simultaneously operate the embedded decoder (P-Mode) and the auxiliary array and store their outputs in the memory area such that

$$\mu_{v}^{(out)}(j) = f_{(-)} \left( \mu_{x_1}^{(in)}(j), \mu_{a_0 \rightarrow e_1}(j) \right) \quad \forall j \in [N/2]_{-}.$$  

Sample the second half of the frozen bits indicator $z$ by the $\hat{z}$ register, i.e. $\hat{z}(0 : N/2 - 1) = z(N/2 : N - 1)$.

/\STEP 3:
\> Set $c_m = \text{mode} = 0, \text{outerCodeID} = 1$.
Execute the embedded decoder in S-Mode on $\mu_{v}^{(out)}(0 : N/2 - 1)$ as the LLR input and $\hat{z}(0 : N/2 - 1)$ as the frozen symbols indicator vector. The realization ID of the embedded decoder (denoted by $\text{realizationID}(N/2)$) is calculated according to [80].

\> Sample the $\hat{u}(0 : N/2 - 1)$ signals array by the second half of $\hat{u}$, i.e. $\hat{u}(N/2 : N - 1) = \hat{u}(0 : N/2 - 1)$. Sample the $\hat{x}(0 : N/2 - 1)$ signals array by registers array $\mu_{v}^{(in)}(0 : N/2 - 1)$, i.e. $\mu_{v}^{(in)}(0 : N/2 - 1) = \hat{x}(0 : N/2 - 1)$.

\> Set $c_m = c^{(BPPE_{\text{internal}})} = c^{(\text{opMux})} = c^{(\text{opDeMux})} = 0, \text{mode} = 1$.
Simultaneously operate the embedded decoder (P-Mode) and the auxiliary array and store their outputs in the memory area such that

$$\mu_{e_1 \rightarrow a_0}(j) = f_{(-)} \left( \mu_{x_1}^{(in)}(j), \mu_{v}^{(in)}(j) \right) \quad \forall j \in [N/2]_{-}.$$  

\> Set $c_m = 0, c^{(BPPE_{\text{internal}})} = \text{mode} = 1, c^{(\text{opMux})} = c^{(\text{opDeMux})} = 4$.
Simultaneously operate the embedded decoder (P-Mode) and the auxiliary array and store their outputs in the memory area such that

$$\mu_{x_0}^{(out)}(j) = f_{(+)} \left( \mu_{a_0}^{(in)}(j), \mu_{e_1 \rightarrow a_0}(j) \right) \quad \forall j \in [N/2]_{-}.$$  

\> Set $c_m = c^{(BPPE_{\text{internal}})} = 0, \text{mode} = 1, c^{(\text{opMux})} = c^{(\text{opDeMux})} = 5$.
Simultaneously operate the embedded decoder (P-Mode) and the auxiliary array and store their outputs in the memory area such that

$$\mu_{x_1}^{(out)}(j) = f_{(-)} \left( \mu_{v}^{(in)}(j), \mu_{a_0 \rightarrow e_1}(j) \right) \quad \forall j \in [N/2]_{-}.$$  

/\Note that $\mu_{x_0}$ and $\mu_{x_1}$ signals array are wired to the $\hat{x}(0 : N - 1)$ output signals array, as specified in Figure 18a.
Algorithm 24 P-Mode of the BP Line Decoder of Length $N$ $(u + v, v)$ Polar Code ($\text{modeIn} = 1$)

\[\forall j \in \lfloor N/2 \rfloor \_\]

Simultaneously operate the embedded decoder (P-Mode) and the auxiliary array such that we have at the output of the decoder

\[\lambda^{(\text{out})}(j) = \begin{cases} f(-)(\lambda(j), \lambda(j + N/2)), & c^{(\text{BPPE,in})} = 0; \\ f(+)(\lambda(j), \lambda(j + N/2)), & c^{(\text{BPPE,in})} = 1. \end{cases} \quad \forall j \in \lfloor N/2 \rfloor \_\]

to consolidate the second and third computation in step 3, leaving our first suggested change obsolete. These changes result in 4, as the free addend in (81) and $T(2) = 2$, so $T(n) = 3 \cdot N - 4$.

The remarks, raised on the SC line decoder recursive design at the end of Subsection 5.1.3 also apply here. Specifically, this design also suffers from long path hazards especially in the routing layers of P-Mode. Consequently, more efficient designs may be applied by unfolding the recursive blocks. Furthermore, the issue of idle clock cycles for the BP PE is also a problem of this design and the solution of Subsections 5.1.4 and 5.1.5 may be adapted to this decoder too.

Note however that while in the SC decoder, the existence of inactive PEs is due to the properties of the SC algorithm, which dictates the scheduling of the message computation, in the BP case, this is due to the scheduling we choose and not a mandatory property of the algorithm. Other types of scheduling do exist, and currently there is no evidence which scheduling is better (for example, in terms of the achieved error rate or in terms of the average number of iterations required for convergence). Hussami et al. [12] proposed to use the Z-shape schedule, which description suggests a constant level of parallelism of $N$ PEs (of the type we considered here) operating all the time. This seems to give the Z-shape schedule an advantage over the GCC schedule if the number of processors is not limited (unless the technique of Subsection 5.1.4 is applied). It is an interesting question to find out which schedule is better, when the number of processors is limited. This is a matter for further research.

5.2 Decoding Architectures for General Polar Codes

Thus far, we described decoding algorithms for the $(u + v, v)$ polar code. This notion has enabled us to restate the SC implementation for Arikan’s construction, that were proposed by Leroux et al. [15]. In addition, we suggested a BP decoding implementation employing the GCC schedule. In this subsection, we generalize these constructions for other types of polar codes. Since we already covered implementations for Arikan’s code in some details, in this section we provide a more concise description of the implementations, mainly emphasizing the principle differences from the designs in Subsection 5.1.

5.2.1 Recursive Description of the SC Line Decoder for General Linear Kernels

Let $C$ be a homogenous linear polar code over field $F$, constructed by a kernel of $\ell$ dimensions. This kernel has an $\ell \times \ell$ generating matrix, $G$ associated with it. Let $f$ be the number of bits required to represent all the field elements, i.e. $f = \lceil \log_2 |F| \rceil$.

Figure 23a depicts the basic processing element (PE) of the SC line decoder. The LLR input $\lambda(0 : \ell - 1)$ and output $\lambda^{(\text{out})}$ are specified such that each entry $\lambda(j)$ is actually a vector of $|F| - 1$ elements. These elements are the logarithms of the likelihood ratio of the zero symbol and one of the $F \setminus \{0\}$ symbols (denoted by $\lambda(t)$ in (20), where $t \in F \setminus \{0\}$). In our block diagrams, thick lines are used to carry these LLR signals. In other words, assuming that each aforementioned $\lambda(t)$ is represented by $\beta$ bits, each thick line is composed of $\beta \cdot (|F| - 1)$ bit lines. The input signal $c_u$ has $\ell$ possible values, each one corresponds to a different LLR processing step as specified in (20). The input signals array $\hat{x}^{(\text{in})}(0 : \ell f - 1)$ represents a coset vector for the currently processed kernel. This is an $\ell$ length word over $F$ and as such it is
Figure 23: Block definitions of SC line decoder for length $N$ polar code based on a linear $\ell$ dimensions kernel with alphabet $F$.

represented by $\ell \cdot f$ bits. Let $\hat{x} \in F^\ell$ be the vector represented by this register array, furthermore let $\left[\lambda_i(t)\right]_{t \in F \setminus \{0\}}$ be the LLR vector corresponding to the signal input $\lambda(i)$, where $i \in [\ell]$. Similarly, let $\left[\hat{\lambda}_i(t)\right]_{t \in F \setminus \{0\}}$ be the LLR vector corresponding to the output signal $\lambda^{\text{out}}$. If the $c_u$ input represents the decimal value $i$ (denote it by $c_u \equiv i$), the circuit’s output is defined by Equation (26).

Figure 23b specifies the block definition for the this general kernel line decoder. Most of the labels of this block’s input and output signals are the same as in Figure 14b and they keep their functionality as well. There are some modifications, however, that are required in order to support the change in the kernel and the alphabet. The signals arrays $\hat{x}^{\text{in}}(\cdot \ell f)$, $\hat{u}$ and $\hat{x}$ represent vectors of length $N$, over the $F$ alphabet. As a consequence, each entry in them is represented by $f$ bits. The input signal $c_{\text{in}}(0 : \lceil \log_2 \ell \rceil - 1)$, used in P-Mode ($modeIn = 1$), has $\ell$ possible values, each one corresponds to a different LLR processing step as specified in (24) in Algorithm 8. Since the maximum number of PEs employed simultaneously is $N/\ell$, the line decoder is designed to have $N/\ell$ length LLR output signals array. The functionality of the decoder in P-Mode is that for all $j \in [N/\ell]$, we have $\lambda^{\text{out}}(j)$ be the output of a PE that is given as inputs the LLR array $\lambda(j \cdot \ell : (j + 1) \cdot \ell - 1)$, the coset vector $\hat{x}^{\text{in}}(j \cdot \ell : (j + 1) \cdot \ell - 1)$ and $c_u = c_u^{\text{in}}$.

In S-Mode ($modeIn = 0$) the decoder outputs its estimations for the information word $\hat{u}(0 : Nf - 1)$ and its corresponding codeword $\hat{x}(0 : Nf - 1)$ given the LLR input signals array $\lambda(0 : N - 1)$ and the frozen indicator vector $z(0 : N - 1)$.

The generalization of the $(u + v, v)$ block diagram in Figure 17 and its corresponding algorithms can be easily completed using the above $PE$ description and Algorithms 8 and 9. We leave the details for the reader.

### 5.2.2 Recursive Description of the BP Line Decoder for General Kernels

Subsection 5.2.1 considered the adaptation of the $(u + v, v)$ line decoder for supporting general kernels. Designing a BP line decoder for general polar codes entails similar difficulties. In this subsection we only highlight the principal necessary modifications to the BP decoder in Subsection 5.1.6 in order to adjust it to the case of $\ell$ dimensions kernel over alphabet $F$.

- The LLR inputs, internal signals and memories should be extended to support LLRs over $F$. See
Subsection 5.2.1 for more details.

- The routing layers OP-MUX and OP-De-MUX need to be extended in order to support all the different messages calculated by the PE.

- The Memory Region in Figure 19 needs to include registers array to support each of the algorithm’s possible messages. Messages that are required to be kept beyond the iteration boundary have to be stored in a matrix, such that each row corresponds to a different realization of the code. The number of LLRs in each row of these matrices is \( N/\ell \), the outer-code length in \( F \) symbols. On the other hand, messages that in each iteration, their values are calculated before being used for the first time (in the iteration) requires only registers arrays of length \( N/\ell \). See Subsection 5.1.6 for more details on the distinction between these two types of messages.

- Algorithms 22 and 23 are replaced by \( \ell \) pairs of steps each one is dedicated to a different outer-code \( C_i \). See Algorithms 17 and 18 for further details.

5.2.3 Decoders for Mixed-Kernels and General Concatenated Codes

So far, we considered decoders for homogenous polar codes over alphabet \( F \). These codes have the attractive property, that the outer-codes in their GCC structure are themselves (shorter) polar codes from the same family. Therefore, we were able to use a single embedded decoder of a code of length \( N/\ell \) symbols within the decoder of the code of length \( N \) symbols. This embedded decoder is used \( \ell \) times, each time on different inputs (i.e. indices of the frozen symbols and the input messages). Unfortunately, this property no longer applies when mixed-kernels polar codes are used.

Let us consider the \( \ell = 4 \) dimensions mixed-kernels polar code described in Example 8. In the decoder for length \( N = 4^n \) bits code, we need to have an embedded decoder of the mixed-kernels code of length \( N/4 \) bits and an additional embedded decoder for the RS4 polar code of length \( N/4 \) quaternary symbols. Note, however, that even here, a reuse of circuits is still possible, as the decoder for the RS4 code of length \( N/4 \), requires an embedded decoder for the RS4 code of length \( N/16 \) within it. The latter decoder (and its embedded decoders) can be shared with the decoder for the mixed-kernels code of length \( N/4 \) (that requires an embedded RS4 decoder of the same length).

Summary and Conclusions

We considered the recursive GCC structures of polar codes which led to recursive descriptions of their encoding and decoding algorithms. Specifically, known algorithms (SC, SCL and BP) were formalized in a recursive fashion, and then were generalized for arbitrary kernels. Moreover, recursive architectures for these algorithms were considered. We restated known architectures, and generalized them for arbitrary kernels.

In our discussion, we preferred for brevity, to give somewhat abstract descriptions of the subjects, emphasizing the main properties while neglecting some of the technical details. However, a complete design requires a full treatment of all of these specifics (see e.g. Leroux et al. for the \((u + v, v)\) case [15]).

A subject that requires a more careful attention, is the study of BP decoder and specifically the proposed GCC schedules. A comparison between this schedule and other proposed schedules (e.g. the Z shaped schedule) is an intriguing question. Furthermore, a comparison of the BP decoder versus SCL decoder for general kernels taking into account error-correction performance and the decoder’s complexity is also an interesting topic. These questions are subjects for further research.

References


