Assembla home | Assembla project page
 

Ticket #41 (closed defect: fixed)

Opened 4 months ago

Last modified 5 days ago

Invalid byte 2 of 3-byte UTF-8 sequence.

Reported by: chyssler Assigned to:
Priority: normal Milestone: 1.2
Component: mercurialeclipse Version:
Severity: normal Keywords:
Cc:

Description

Using Hg r837 I got some problems with the SAX parser on Windows XP.

Got this when trying to show history

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanLiteral(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanAttributeValue(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.vectrace.MercurialEclipse.commands.AbstractParseChangesetClient.createMercurialRevisions(AbstractParseChangesetClient.java:440)
at com.vectrace.MercurialEclipse.commands.HgLogClient.getProjectLog(HgLogClient.java:126)
at com.vectrace.MercurialEclipse.history.MercurialHistory.refresh(MercurialHistory.java:191)
at com.vectrace.MercurialEclipse.history.MercurialHistoryPage$RefreshMercurialHistory.run(MercurialHistoryPage.java:112)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55)

Attachments

charset_fix.diff (1.4 kB) - added by mknittig on 11/26/08 06:59:09.
Other patch contained garbage
test.xml (0.6 kB) - added by mknittig on 11/26/08 07:12:39.
Commit XML from the debugger

Change History

08/12/08 20:26:25 changed by bastiand

Hey Stefan, please provide more info *g*. Interesting would be (as you of course know) filename, description etc. of the changeset :-).

This seems to be due to the locale stuff that is especially painful when using XML parsers. I've already thought about writing a custom parser that just takes the output and doesn't bother about locales...

09/13/08 09:34:00 changed by bastiand

The only solution to this that I can think of is to create a dialog to choose an encoding per hg root, e.g. UTF-8 or ISO-8859-15 that can then be used as preamble for the XML document format. Everything else won't work, as we don't know whether somebody used an e.g. cyrillic codeset for creating a changeset description. And Mercurial will only save the description as byte array, so no help from that side...

(follow-up: ↓ 4 ) 11/04/08 22:12:45 changed by bastiand

Any news on this. Does it still occur? I've submitted a changeset a few weeks ago, that might have fixed it...

(in reply to: ↑ 3 ) 11/10/08 22:34:15 changed by jimrobi@cisco.com

Replying to bastiand:

Any news on this. Does it still occur? I've submitted a changeset a few weeks ago, that might have fixed it...

I just did a fresh install of Mercurial (1.0.2) and Mercurial Eclipse (1.1.867) on Windows XP and I am seeing this when doing a pull. Note that the pull seems to succeed nonetheless. I do not see it when trying to show history.

I do NOT see this on Linux - Mercurial 1.0.1 and Mercurial Eclipse (1.1.867).

11/11/08 06:47:13 changed by bastiand

The fix is only available in the current beta so far... But how does the changeset xml looks like that makes Mercurial under windows cry?

11/25/08 12:46:26 changed by mknittig

I'm using Mercurial 1.0.2 and MercurialEclipse? 1.1.19 and the bug is still there... :-/

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:674)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:398)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1742)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(XMLEntityScanner.java:487)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2679)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at com.vectrace.MercurialEclipse.commands.AbstractParseChangesetClient.createMercurialRevisions(AbstractParseChangesetClient.java:447)
at com.vectrace.MercurialEclipse.commands.HgLogClient.getProjectLog(HgLogClient.java:129)
at com.vectrace.MercurialEclipse.history.MercurialHistory.refresh(MercurialHistory.java:179)
at com.vectrace.MercurialEclipse.history.MercurialHistoryPage$RefreshMercurialHistory.run(MercurialHistoryPage.java:107)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55)

11/25/08 18:08:19 changed by mknittig

Seems to be a problem with non-ASCII charsets. Google says to problem occurs when you try to parse a UTF-8 document with ISO-8859-1 characters. I don't really understand the problem. When my commit message in Mercurial contains non-ASCII chars shouldn't they be UTF-8 encoded?

11/25/08 23:50:40 changed by mknittig

This patch solve the problem for me. Mercurial uses UTF-8 as default and so do all modern Linux distributions. But Windows XP doesn't...

11/26/08 06:53:59 changed by bastiand

Uh...the patch only contains an import. I can't imagine that an import solves the problem. I still haven't been able to reproduce. Could you please paste a changeset summary that causes this bug to trigger, together with local codepage and repository codepage you used? I assume this can maybe be fixed by using the HGENCODING variable.

11/26/08 06:59:09 changed by mknittig

  • attachment charset_fix.diff added.

Other patch contained garbage

11/26/08 07:12:39 changed by mknittig

  • attachment test.xml added.

Commit XML from the debugger

11/26/08 07:15:09 changed by mknittig

The commandline says CP437, but the system has probably CP1252 (the default, I didn't change anything). I didn't change anything on the Mercurial settings too. So I assume that Mercurial uses the default (UTF-8).

11/26/08 07:20:54 changed by bastiand

what's the command line output?

11/26/08 11:29:58 changed by anonymous

In Eclipse? The Changeset XML in the Mercurial Log and the Exception in the Eclipse Error Log...
I don't think the HGENCODING will help here, because that changes the encoding of Mercurial and not the system. It would only help if Mercurial converts his internal UTF-8 changesets into changesets which use the system encoding for the system and vice versa (the Repositories I push and pull need UTF-8 changesets)...
The patch I submitted should definitely solves the problem. I just tested it under another system...

11/26/08 11:38:02 changed by mknittig

Ops, forgot to log me in. BTW: How can I add me to the CC list?

11/26/08 18:40:01 changed by bastiand

Yeah, the login issue... I always forget it myself as well... You can either create an Assembla account and log yourself in with it or add your e-mail as CC. That should normally work. If you got an Assembla account, you'll receive notifications when somebody adds to a ticket that you've responded to.

I'll test your page as soon as I got Mercurial working on my computer. Seems a bit more complicated to run current crew on MacOS compared to Ubuntu...

11/26/08 19:07:19 changed by bastiand

Btw., page = patch.

11/27/08 06:32:39 changed by bastiand

  • status changed from new to closed.
  • resolution set to fixed.

A slightly modified form of your patch is now in mainline. Thanks for providing it! Btw., the patchfile failed to apply on my repository - did you use an old MercurialEclipse? version as base?


Add/Change #41 (Invalid byte 2 of 3-byte UTF-8 sequence.)




Action