So ns was trying to parsethis gff filefrom MGI, with mouse geneannotations. And, well, i’m an idiot. However in a method that is potentiallyinstructive.
You are watching: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : eof within quoted string
The documentation for the document is adocx file(not yes, really a recommended style for together metadata), yet it seemsrather simple, really: tab delimited, v 9 columns, the nine columnbeing a bunch of pasted attributes that demands to be additional parsed,but we’ll skip over the detail.
I’d want to use fread() indigenous thedata.table package,but it turns out the the document has a bunch that lines v “###”interspersed within the data, and I couldn’t check out a means to skip overthose in fread(), so ns fell ago to the usual basic R function,read.table().
Let’s very first download and unzip the file.
# download the filesite ", tmpfile)) remove_tmpfile Okay, now to review it into R with read.table().
tab This gives a warning message:
Warning message:In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF in ~ quoted stringHmm. What does that mean? Oh, no matter, let’s relocate on…
Wait, there space no genes on chromosomes 5, 8, 15, 18, Y, or MT. Howcould the be? Something must be wrong with the file. Let’s look atanother file at the site,MGI.20160103.gff3.gz.That one’s lacking chromosomes 8 and also 13.
So ns askDan Gatti:“Hey, those files are corrupted. Who must I speak to around them?”
And he’s like, “That’d it is in a disaster, however they watch fine come me
So i tried making use of read.delim() and sure enough, no warning, geneson every chromosomes, and around twice as countless records. Oops.
read.delim() vs read.table()
So those the difference in between read.delim() and read.table()?Well, read.delim() calls read.table() with a particular collection ofdefault worths for the arguments:
> read.delimfunction (file, header = TRUE, sep = "\t", quote = "\"", dec = ".", to fill = TRUE, comment.char = "", ...)read.table(file = file, header = header, sep = sep, quote = quote, dec = dec, fill = fill, comment.char = comment.char, ...)The crucial argument below is quote, in the read.table() offers quote=""""(that is, trying to find either single- or double-quotes)while read.delim() provides quote=""" (just searching for double-quotes).
There space no double-quotes in the file, but that nine column includes some single-quotes, andso my usage of read.table() was mucking every little thing up.And presumably there to be an odd number ofthem, therefore the end-of-file (EOF) character to be inside one of thosequoted strings.
To review the file properly, i should have actually used quote=""". Evenbetter, I could use quote="", and also for that matter additionally fill=FALSE(since every line is claimed to contain all nine columns).
LessonsThere are numerous lessons here.
I shouldn’t have ignored the “EOF within quoted string”warning.
I need to have contrasted the variety of lines I review in v the numberof present in the entry file. If I’d done so, I’d have seen that ns hadjust about fifty percent as many lines together I should’ve, and so I’d clearlymessed other up.
When i run right into a problem like this, it’s much more likely the there’s aproblem with my code 보다 that yes a trouble with the file.
See more: Marie-Lise Volpeliere-Pierrot, Top Of The Pops Facts On Twitter: Ben Volpeliere
With a record of this sort, ns should have actually used quote="" andfill=FALSE in my call to read.table(). I’m not expecting anyquoted fields, and also I’m expecting the every line will certainly have exactly ninecolumns.