Assume a Quantum Data Set

. Data-processing algorithms often require that the data is prepared in appropriate structures that are readily accessible or can be prepared on demand. Quantum computers derive their power from storing and manipulating quantum superpositions and could potentially speed up data science tasks. However, they often require input in the form of a quantum state that encodes a nonquantum data set. Here we describe some of the challenges of encoding nonquantum data for use by quantum computers


Initialization for quantum algorithms
Quantum computing promises to solve problems that are infeasible for traditional computers. Apart from the simulation of quantum systems, such as in chemistry or material science, there has been a surge in quantum linear algebra algorithms suitable for high-dimensional problems. These algorithms include linear system solvers, regression or machine learning algorithms that have the potential to perform otherwise impossible data science tasks. These otherwise-impossible tasks would likely involve extraordinarily large data sets in which the superior asymptotic complexity scaling of quantum algorithms can prevail over highly optimized supercomputer code.
It is important to emphasize that the 'superior asymptotic complexity scaling' to which we and other quantum computer scientists refer assesses only the complexity of processing data. Our aim in this commentary is to elucidate the oft-neglected complexity of encoding data into an appropriate format for quantum processing.
We expect the quantum computer to achieve an advantage over classical computers by employing a 'quantum' data encoding, meaning that the data would be presented in some kind of quantum superposition. Thus, the quantum computer can take advantage of entanglement and superposition when processing the data instead of processing them bit-by-bit as classical computers do. The data will be then presented as a quantum state that cannot be copied and need to be measured in order to retrieve classical information that would lead to a collapse of the superposition. Published quantum algorithms usually assume that the data is accessible in the form required by the quantum algorithm. One might assume that a quantum programmer has access to quantum data from a cloud, however, interacting with such data sets could likely lead to the creation of entanglement between the programmer's quantum computer and the cloud. Alternatively, the quantum data scientist would only have access to a classical database and it will be up to them to transform the classical data into quantum states in an appropriate form.
For clarity's sake, we describe in detail two common methods of quantum data encoding: bit encoding and amplitude encoding (following the terminology of Wiebe, 2020). In both cases, we consider the task of encoding a classical data set where each x ℓ encodes an M -dimensional data point. Focusing on 'big data', we assume that at least one of M or N being very large.

Bit encoding
Suppose that each data point x ℓ (ℓ = 1, 2, . . . , N ) can be represented by the unique bitstring where the integer K depends on the kind of data being encoded. If, for example, each M -dimensional data point is a length-M array of 32-bit floating point numbers and we have not compressed the data in any way, then K = 32M . Then we can directly encode the data point x ℓ in a quantum register consisting of K qubits: Such encoding can be accomplished for a given data point by first preparing the quantum register in the all-zero state |0⟩ |0⟩ . . . |0⟩ and then applying bitflip operations to the appropriate qubits. However, nothing about this approach is yet 'quantum': the strategy can be understood in purely classical terms. This approach becomes quantum if we choose to encode the data set as a superposition of such bit-encoded datapoints; for example, where the coefficient of 1 √ N is present to enforce the technical requirement of state normalization. We refer the reader to chapter 4 of Schuld and Petruccione (2021) for details.
The bit encoding method would play an important role in data science techniques that depend on Grover's search algorithm (Grover, 1996), or closely related techniques, due to the need for a black-box quantum subroutine. This means the user of Grover's algorithm is expected to construct a quantum program that assigns to each possible input ℓ a mark that indicates to the search algorithm whether or not the marked item is that which is being searched for: the program should perform |ℓ⟩ |0⟩ → |ℓ⟩ |mark(ℓ)⟩. One could imagine programming some function that assigns a mark to each possible data point (i.e., performs | ⃗ b⟩ |0⟩ → | ⃗ b⟩ |mark( ⃗ b)⟩ for any bitstring ⃗ b), but this would be an incomplete solution. We also need a data encoding routine of the form |ℓ⟩ | ⃗ 0⟩ → |ℓ⟩ | ⃗ b ℓ ⟩, where ⃗ 0 refers to the length-K bitstring consisting entirely of zeroes. Then we could combine the two into the sort of black-box quantum subroutine required by Grover search.
The bit encoding method then arises because Grover's search strategy involves preparing superpositions like 1 √ N N ℓ=1 |ℓ⟩ using O(log N ) quantum operations. The user-provided quantum subroutine would then be used once to create a superposition of the form at which point Grover's algorithm prescribes a technique to boost the amplitude of marked items at the expense of unmarked items with O( √ N ) uses of the black-box quantum subroutine that, to repeat, must be supplied by the user. By contrast, we require O(N ) calls to that user-provided subroutine in nonquantum computing. This is an enticing potential speedup, but one must remember that the overall cost of the algorithm depends heavily on the computational cost of the user-provided quantum subroutine, which depends in a potentially complicated way on the parameter K and may turn out to dominate the computational complexity. It is this overall cost that must be assessed when evaluating quantum versus classical approaches. We caution data scientists against the tempting habit to ignore the cost of the user-provided subroutine and counting only the number of uses of that subroutine.

Amplitude encoding
In contrast with bit encoding, the amplitude encoding method seeks to encode each individual data point in the amplitudes of a quantum superposition, rather than directly in the quantum register. In this case, we assume each data point x ℓ can be represented with a length-M vector ⃗ v ℓ = v ℓ 1 , v ℓ 2 , . . . , v ℓ M that, for technical reasons, we further assume to be normalized (⃗ v ℓ · ⃗ v ℓ = 1) and positive (⃗ v ℓ k ≥ 0 for each k, ℓ)-see Schuld and Petruccione (2021) for a detailed discussion. The amplitude encoding can encode each data point in quantum superposition using a procedure like The amplitude encoding is then analogous to storing the vector ⃗ v ℓ by creating an M -sided die such that, when rolling that M -sided die, the probability of side k showing is equal to the square of v ℓ k . In other words, we perform readout of the data x ℓ from the M -sided die by (1) rolling the die many times, (2) recording the relative frequencies f k of each outcome k, and (3) calculating the square root √ f k ≈ v ℓ k ; the more we roll the die, the more accurate the readout. Note that step (3) is necessary because a quantum amplitude is related to the square root of a probability.
One could then encode the entire data set as a 'superposition of superpositions': Although this strange encoding has important limitations, it has one powerful feature: the number of qubits needed to store this state is O(log(N M )) = O(log N + log M ) because we need only O(log N ) qubits to store the index ℓ and O(log M ) qubits to store the index k. This contrasts with the O(M N ) space cost one would expect for storing such a data set in a classical register. The exponential improvement to space cost underlies the potentially exponential gains from applications such as quantum linear system solvers (Harrow et al., 2009) and quantum data fitting algorithms (Wiebe et al., 2012), however, defining the encoding could be quite computationally difficult (see, e.g., Aharonov and Ta-Shma, 2007) and thereby wipe out any advantage to using the quantum computer.

Conclusion
The implication for data scientists is that there are enticing quantum speedups to be investigated, but it is not easy to translate those potential speedups into actual improvements for practical problems. This challenge is not unique to data science and the computational cost of data encoding has been raised in the context of quantum machine learning in the work of Aaronson (2015) and Wiebe (2020).
The proper choice of quantum data encoding methodology depends on the kind of data being analyzed and the kind of algorithm to be applied. We should expect that the quantum data encoding methodology has meaningful and potentially detrimental effects on the overall efficiency of the algorithm, and that the complexity analysis of that quantum data encoding methodology is a potentially challenging intellectual exercise.
The difficulty of encoding data for quantum processing indicates a larger need to consider the management of quantum memory. While the immediate development of quantum computers focuses on perfecting and scaling the quantum processing unit (QPU), quantum memory management could soon become an important and distinct research area. Researchers are already starting to consider the role of quantum RAM (Arunachalam et al., 2015;Giovannetti et al., 2008) as well as quantum ROM (Berry et al., 2019;Low et al., 2018) within the analysis of quantum computer applications.
Our key message is that it is necessary to consider the complete quantum algorithm-input, data processing, and output-in order to compare it with its classical counterpart. Since quantum information cannot be copied and is destroyed by measurement, the complexity of input is a pervasive cost that needs to be factored into the computation. We therefore advise data scientists to temper their excitement about the promise of quantum algorithms by heeding the challenge of presenting data to the quantum computer.
Disclosure Statement. The authors have no conflicts of interest to declare.