“The quality of a leader is reflected in the standards they set for themselves.”
Python Unicode handling has been always a headache for those who are first experimenting with the language and also for more experienced developers. A simple Google search of the words “Python” and “Unicode” brings hundreds of articles about Python Unicode handling and as many or more StackOverflow questions about it. There is even a HOWTO Unicode section in the Python official help. This is just another Python Unicode pain history, not a one that can be included in the most common Unicode problems, but one that to the date remains satisfactorily unsolved.
The issue arises when trying to upload a multipart/form-data with Unicode filenames to some servers using the requests library. Although the files are uploaded, the server behaves like no data was received. Needless to say that files with no Unicode are uploaded just fine. To shed some light about what might be happening we need to see if there is any difference in how the forms are encoded in the request. Let’s try first to upload a file with no Unicode file name:
And this is the request generated by that post:
Let’s try now to upload a file with Unicode filename:
An hear is the generated request:
As you can see the requests format are not exactly the same. The Content-Disposition header for the filename slightly changed from “filename=” to “filename*=utf-8””. So, filenames are not only encoded different but also the way in which they are described in the request. Which is then the correct format?. Let’s see the request generated by the Firefox web browser which correctly uploads files to the server:
Even for Unicode filenames, Firefox seems to be formatting the Content-Disposition headers as it was a no Unicode filename. Why then the requests library change the request format? In reality, the request is not encoding by the requests library but by urllib3. In the requests documentation we can read the following:
Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.
Therefore a more accurate question is: Why the urllib3 library behaves in this way?. Forms fields headers are encoded inside urllib3 by the function format_header_param defined in fields.py. Particularly, Unicode filenames are encoded by the following piece of code:
From the above it is clear that the way in which urrllib3 encodes Unicode filenames obey to RFC 2231 specifications. According to Wikipedia:
Request for Comments documents were invented by Steve Crocker in 1969 to help record unofficial notes on the development of ARPANET. RFCs have since become official documents of Internet specifications, communications protocols, procedures, and events.
Let’s explore RFC 2231 then. It was defined in November 1997 and defines MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations. The Introduction section states that:
MIME is now widely deployed and is used by a variety of Internet protocols, including, of course, Internet email. However, MIME’s success has resulted in the need for additional mechanisms that were not provided in the original protocol specification.
Later on, Section 4 on Parameter Value Character Set and Language Information says:
Asterisks (“*”) are reused to provide the indicator that language and character set information is present and encoding is being used. A single quote (“‘”) is used to delimit the character set and language information at the beginning of the parameter value. Percent signs (“%”) are used as the encoding flag, which agrees with RFC 2047.
So far so good. However, it has been more than 21 years already since RFC 2231 was defined. Is this still the correct way of encoding Unicode file names in a form submission?. The short answer is no. RFC 7578 proposed in July 2015 and currently adopted by the HTML 5 standard defines Returning Values from Forms: multipart/form-data. Already in the abstract, it is stated that:
This specification defines the multipart/form-data media type, which can be used by a wide variety of applications and transported by a wide variety of protocols as a way of returning a set of values as the result of a user filling out a form. This document obsoletes RFC 2388.
RFC 2388 proposed in August 1998 was the one recommending the use of RFC 2231 to encode Unicode filenames in multipart/form-data submissions. Furthermore, RFC 7578 clearly says that:
The encoding method described in [RFC5987], which would add a “filename*” parameter to the Content-Disposition header field, MUST NOT be used.
Summing up, it is clear that urllib3 is now using an obsolete standard to encode Unicode file names in multipart/form-data submissions. Why then urllib3 stilluses a standard made obsolete more than 3 years ago now?.
There is a considerable number of StackOverflow questions by people running into this problem as well as urllib3 issues on Github, dating for more than 5 years ago already. People have been debating this issue for what it seems now an eternity. Reading through all the StackOverflow answers and Github issues it seems that for a while it was not clear how to encode Unicode filenames in multipart/form-data forms. In 2014 the HTML 5 specifications were yet a draft but now they have been fully adopted as the web standard. When going through all urrllib3 Github issues. there is one that stands out over all the others: Issue #303 (https://github.com/urllib3/urllib3/issues/303) created on January 4, 2014. The issue still remains open and its last post is from last year. Honestly, it seems like urllib3 developers are showing some reticence to abandon RFC 2231 and are not given solid reasons for this. The staff member sigmavirus24 said in a post on this issue:
1. RFC2231 does work in the real world, but in spite of being a standard, it’s not implemented in every http server.
2. We have built into this library a whole separate library for dealing with multipart/form-data, instead any future improvements should be developed as an external library (in my opinion) and (maybe) vendored in.
3. Unfortunately, a quick search of pypi indicates no one has implemented this yet
4. Provided the fact that for probably 90% of our users the status quo just works, I’m not sure how urgent this issue is, regardless of its age
I really do not know why he said that since is already clear that RFC 2231 is not the current standard. Regarding point 4, I really can not see how to ignore an issue confronted by 10% of your users for such a long time is a correct approach. Point 2 and 3 seem a little more interesting although I really do not understand why a new library has to be implemented in order to urllib3 continuing to use an obsolete standard. Hopefully, this will be solved in the near future although my hopes are not high. Lastly, there is an open pull request aiming to solve this problem but it has not been yet merged (https://github.com/urllib3/urllib3/pull/1492). The last post of this pull request has some interesting thoughts:
I haven’t studied the RFC’s in depth but the discussion in #303 seems valid and it looks like this was first attempted in #304 – when RFC 7578 was still a draft but as of 2015 it has now rendered RFC2388 obsolete. If I’m not mistaken the usage of RFC2231 encoding was done in accordance with RFC2388.
If we aren’t worried about backwards compatibility then it would seem like going with the new standard method of multi-part encoding of filenames would make the most sense. From my reasoning it is unlikely that anyone will be adding support for a 22+ year old standard that has been rendered obsolete whereas I know that a number of people from various bug reports have had issues with the way UTF8 filenames are encoded.
I could understand if there were counter-examples where a server would only work with RFC2231 and not support the newer standard but moving forward I think it would make sense to support the newest standard. Like I said before I can’t totally vouch that this code does everything it needs to do at this point but it did solve the problem I ran into and so I’m willing to help try to hash it out so others can benefit from this change. I have contributed far less and have far less skin in the game than anyone else here at this point so my interest is just trying to figure out the best technical approach.
For me, the solution is to provide both options if backward compatibility is the problem, but at the end, urllib3 will need to entirely drop the support of an old and obsolete standard. But as I said it is had been 3 years already since RFC 2231 was made obsolete and still urrlib3 developers seem to be advocating for its use. If you are a person confronting this problem, I hope you do not feel as frustrated as me. I have semi-good news for you, there is a workaround proposed in this blog: http://linuxonly.nl/docs/68/167…. Ok it is a hack, and if you are like me, it will not make you feel better, but for a person needed for a quick solution it could be a lifesaver.