Java InputStream glitches on Android

On 12 February 2016, beginning from around 1440, there were reports that the Headuck Call Blocker App (hereafter “the App”) failed to download the updated database from HKJunkCall.com (released at around 1340), while other apps could do so normally. Having just upgraded the App to use https (TLS) instead of http for downloading, and with some security features to prevent communication over wiretapped channels, the new code came into mind as a primary suspect for the failure.

It turned out that the problem was instead in the seemingly benign piping of simple byte stream between different standard libraries in Android.

The tale of two libraries

Some basics – the InputStream class of Java is a base class representing an input stream of bytes. A number of libraries in Java supports / builds around the InputStream class, making it a common “protocol” for chaining byte processing in Java.

The App downloads a frequently updated list of junk call numbers, provided by HKJunkCall.com in a compressed (zipped) XML format. To get the contents, it needs to decompress (unzip) the downloaded file and then parse the XML. Instead of downloading and storing the zip file on the file system, extracting it as an XML file, and then feeding the file to an XML parser, the App uses the approach of streaming the downloaded data through the necessary processing steps, which avoids the need to manage intermediate files or hold large buffers.

In Android, the standard library for decompressing a file in the zip format is java.util.zip.ZipInputStream. It inherits from InputStream, takes a compressed stream as input, and provides the decompressed file as an InputStream.
For XML parsing using an event model, the standard library is javax.xml.parsers.SAXParser. It returns an SAX interface to the XML parser (org.apache.harmony.xml.ExpatParser in this case), and provides methods to parse files from an XML input source / stream.

Naturally, it should be simple and straightforward to feed the downloaded zip data into the ZipInputStream class, and then feed that stream into the XML Parser. The XML parser should then happily parse the XML contents extracted from the zip file (provided of course that the XML is valid and with the right encoding).

This was implemented in the App, and indeed worked MOST of the time (ever since, and even before, the release of the App), until things go wrong.

The contract of read()

The problem lies in fine details of how the seemingly simple method, read(byte[] b) (and its cousins) of InputStream, is implemented by its sub-classes, notably ZipInputStream.

The read method is a simple and standard one. It is used to read a byte stream into the provided buffer, b. According to the official documentation of the base class, it –

Reads some number of bytes from the input stream and stores them into the buffer array b. The number of bytes actually read is returned as an integer. This method blocks until input data is available, end of file is detected, or an exception is thrown.

If the length of b is zero, then no bytes are read and 0 is returned; otherwise, there is an attempt to read at least one byte. If no byte is available because the stream is at the end of the file, the value -1 is returned; otherwise, at least one byte is read and stored into b.

From the description, it is clear that, unless the length of the buffer array is 0, read(byte[]) should return either -1 (indicating the end of input), or the number of bytes read which should be >= 1, and never 0. If no byte is read but the end of stream has not been reached, the method should block, instead of returning 0.

The culprit?

Unfortunately, the ZipInputStream in Android does not seem to follow the semantics strictly. Here is an extract of its read method (Copyright 2005-2008 The Android Open Source Project under Apache 2.0)

if (inf.needsInput()) {
    fill();
    if (len > 0) {
        entryIn += len;
    }
}
int read;
try {
    read = inf.inflate(buffer, start, length);
} catch (DataFormatException e) {
    throw new ZipException(e.getMessage());
}
if (read == 0 && inf.finished()) {
    return -1;
}
crc.update(buffer, start, read);
return read;

inf is a variable of the Inflater class, a wrapper of a native function to decompress the zip file. From the extract alone, it seems probable that, when read == 0 is returned from inf.inflate(), inf.finished() could possibly be false, and the return could be 0.

This indeed happened – the zip file for the update on 12 February, when fetched via https and fed into the ZipInputStream class as input, and in certain versions of Android (2.2 and 4.4, but not 5.0), caused 0 to be returned by the read function in the middle of decompression of the XML stream, well before the end of the stream was reached.

This might be a problem originating from the ZipInputStream class itself, or was due to the behaviour of the underlying input stream (e.g. output from decryption of https stream), with ZipInputStream propagating the (unexpected) problem upwards without dealing with it. I did not further delve into the source to see why, since I had to handle this output either way.

Deviation from the semantics might not be a problem in itself, if the consumer of the stream were tolerant enough. In fact, in a number of common implementations for consumers of input streams, returns with zero length from read are effectively ignored (for example, by appending nothing to a buffer, and read again). This seems to be the case too for the XML parser in ExpatParser.java which the ZipInputStream in the present case is ultimately fed into: (Copyright (C) 2007 The Android Open Source Project, Apache 2.0)

/**
 * Parses XML from the given input stream.
 */
private void parseFragment(InputStream in)
        throws IOException, SAXException {
    byte[] buffer = new byte[BUFFER_SIZE];
    int length;
    while ((length = in.read(buffer)) != -1) {
        try {
            appendBytes(this.pointer, buffer, 0, length);
        } catch (ExpatException e) {
            throw new ParseException(e.getMessage(), this.locator);
        }
    }
}

However, unfortunately for this case, the input stream was fed into the parser in a slight convoluted manner. A standard Java class java.io.InputStreamReader was used first to wrap the ZipInputStream, and the reader itself was wrapped in an InputSource class, which also encapsulated other information such as encoding, before feeding to the SAX parser interface. Thus, the stream passed through InputStreamReader‘s read() method. Given the above semantics of InputStream‘s read, this intervening method turned the 0 return of ZipInputStream‘s read into -1 (which signals the end of the stream), before feeding it into the parser.

The behaviour is evident just by looking at the return statement of InputStreamReader‘s read method, used when the buffer length is greater than 0 (note that this code will never return 0):

return out.position() - offset == 0 ? -1 : out.position() - offset;

As a result, to the XML parser, the input stream would appear to have reached its end at the juncture when ZipInputStream‘s read emitted a 0 return, (less than) half way in parsing the XML file in this instance. It happened that the truncation occurred just after starting an XML tag, and the following ParseException was thrown by the parser.

org.apache.harmony.xml.ExpatParser$ParseException: At line X, column Y: no element found

It was somewhat lucky that, in this case, the behaviour was reproducible in reruns (perhaps due to the same fragmentation of the input zip stream after decryption operation, enabling the 0 return of the ZipInputStream to be reproduced). This allowed tracking down of the error.

The cure

Urgent fix

Before this was completely tracked down, I found that the problem could be worked around by simply buffering the whole decompressed XML file in a BufferedInputStream (i.e. by calling mark() with a limit larger than the decompressed file size at the beginning, reading the whole stream until end, and then reset() it, before feeding the stream to the parser). In retrospect, this was because the buffering of the ZipInputStream by BufferedInputStream (which did not involve the InputStreamReader in-between) was tolerant of 0-byte reads, and the subsequent re-play of the stream to feed through InputStreamReader into the parser no longer produced 0-byte reads.

BufferedReader reader1 = new BufferedReader(reader);
reader1.mark(3000000);
while ((line = reader1.readLine ()) != null) {
}
reader1.reset();
xmlReader.parse(new InputSource(reader1));

This ugly approach (hardcoded size limit and bad efficiency) was used as a stop-gap patch to fix the problem, and was release at around 1750, 3 hours after the initial report, in the urgent bug-fix version (0.2.2) of the App.

Ultimate fix

Knowing deeper about the problem, it was fixed by something similar to subclassing the ZipInputStream class and overriding its read methods, e.g.

@Override
public final int read(byte[] b) {
    int ret;
    do {
        ret = super.read(b);
    } while ((ret == 0) && (b.length > 0));
    return ret;
}

This rectifies the semantics of ZipInputStream ‘s read function, i.e. that read() would not return 0, unless the buffer is of zero length, allowing the InputStream to be handled by other routine Java libraries without problems. (This is not the actual fix used, which involves modifying an intervening thin layer of FilterInputStream to the same effect.)

Another fix, which could also avoid the problem when used alone, is to bypass the InputStreamReader / InputSource classes in between, and directly feed the byte stream to the parser, i.e. instead of

XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setContentHandler(handler);
xmlReader.parse(new InputSource(new InputStreamReader(is, "UTF-8")))

use

saxParser.parse(is, handler)

and let the parser deal with the encoding (good luck). This should obviously be more efficient too. So both fixes are applied in the new version.

The takeaway

The problem is not unique to ZipInputStream (or perhaps its underlying input stream). It seems that a number of other standard libraries in Java is implemented inconsistently in this aspect. See the answers here. So if one has to pass an (unknown) input stream to a reader class or other consumers which are not tolerant of zero-length reads, either make sure that the read method of the input stream would not return 0 for non-zero-length buffers (which would mean studying its source in most cases, unless explicitly documented), or subclass it similarly to make sure that the semantics are followed.