Uploaded image for project: 'Hibernate Search'
  1. HSEARCH-2886

Use of BufferedWriter in GsonEntity may lead to MalformedInputException when input contains 4-byte unicode characters

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.8.0.Final, 5.9
    • Fix Version/s: 5.8.1.Final, 5.9
    • Component/s: elasticsearch
    • Labels:
      None

      Description

      GsonEntity is designed to behave well with "reactive" I/O, and as such it must handle overflow (too much input when the consumer is not ready). It currently does so by storing "overflowing writes" in char buffers.
      In order to avoid creating too many small buffers, we made sure to wrap the writer collecting input in a BufferedWriter, whose buffer size was set to 1024.
      So far, so good.
      But... It turns out that encoding arbitrarily-split chunks of characters does not work well. Specifically, when a unicode character is encoded on 4 bytes (i.e. two 16-bit char), and when the left and right char are not written to the same char buffer:

      • the left char at the end of the "left" buffer may be silently discarded by the encoder
      • the right char at the start of the "right" buffer may lead to a MalformedInputException, because the encoder has no state and does not remember the left char

      I experienced the issue first-hand while playing on a demo, so don't tell me it's a rare and insignificant occurrence

      Test case and solution coming in a PR.

        Attachments

          Activity

            People

            • Assignee:
              yrodiere Yoann Rodière
              Reporter:
              yrodiere Yoann Rodière
              Participants:
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: