I happen to have a corpus which includes pretty much every word ever written in a book, including many misspelled, mistranscribed, or otherwise non-dictionary words.
After eliminating nonsense, non-English, or other mistakes, I think the real winner, coming it at 12 characters, is:
teetertotter
That's a relatively common word. Even though it's usually seen hyphenated, the unhyphenated form is recognized by all the online dictionaries I found.
----
And some other candidates, just for fun, in the 13 or 12 character range:
"proproprietor" seems more like a misspelling. Should have a hyphen, or be two words.
"priorityqueue" is of course familiar to hackers here, but is more of a jargon term, and is only concatenated due to appearing in source code. Invariably it's two words when actually written out.
"preprototype" is used exactly as is, in lots of scientific papers, up to the current day. That's a pretty good one too, and could be a tie for "teetertotter", but it's verging on jargon.
How did you scrape that data? How do you store and retrieve it? Is it just a standard db or a vector db?
Sorry for the questions, but it seems like an interesting, yet probably common data set and as someone who is venturing down this path, I’d like to learn more about building my own dataset similar to this from scratch.
Works for Ubuntu, too. My Colemak self can only get fluffy (6) from the front row, that's the longest word. Middle row really shines though, I can get hardheartedness (15) or assassinations (14).
8 flagfall "Flagfall, or flag fall, is a common Australian expression for a fixed start fee, especially in the taxi, haulage, railway, and toll road industries."
8 galagala "A name in the Philippine Islands of Dammara Philippinensis, a coniferous tree yielding dammar-resin."
Lower/Third Row:
- None
There are no vowels on the bottom row. So no words. I've been typing at ~ 50wpm for 30 years, and I don't think I'd ever actually consciously recognized this fact about the bottom row.
Dyalog APL, using the enable1 wordlist, I don't know its origins but you can get it from Peter Norvig's website https://norvig.com/ngrams/enable1.txt or various GitHubs and Gists:
Reading from the right, "test each word by removing 'qwertyuiop' and see if it leaves an empty string, use the test results to filter the input word list, descending-sort the length of each word and use that to arrange(index) the remaining words, flatten the array and take the top 7".
(Longest from the middle row is 'haggadahs' then 'alfalfas', third row is 'mm')
So it seems that in addition to having parts of its kernel based on FreeBSD, there is also a lot of similarities in the wordlist at /usr/share/dict/words of macOS to that of FreeBSD :) perhaps even the same?
C:\>grep '^[qwertyuiop]*$' /usr/share/dict/words | awk '{print length, $0}' | sort -rn | head
Bad command or file name
Bad command or file name
SORT: Too many parameters
Bad command or file name
MS-DOS 6.22 (excluding any typos as I rewrote it, I only did a proof of concept with a few words but it seemed to work).
@echo off
mkdir c:\t
echo prompt echo %1 $g C:\t\%1 >> c:\temp.bat
command /c temp.bat %1 > c:\h.bat
del c:\temp.bat
type words.txt | find /V "a" | find /V "s" | find /V "d" | find /V "f" | find /V "g" | find /V "h" | find /V "j" | find /V "k" | find /V "l" | find /V "z" | find /V "x" | find /V "c" | find /V "v" | find /V "b" | find /V "n" | find /V "m" > c:\words2.txt
echo > h.bas DIM word as STRING
echo >> h.bas OPEN "C:\words2.txt" for INPUT as #1
echo >> h.bas DO WHILE NOT EOF(1)
echo >> h.bas INPUT #1, word
echo >> h.bas CMD$ = "C:\h.bat " + word
echo >> h.bas SHELL CMD$
echo >> h.bas LOOP
echo >> h.bas CLOSE #1
echo >> h.bas SYSTEM
qbasic /RUN c:\h.bas
dir /OS /B c:\t
@del c:\t\*
@rmdir c:\t
@del c:\words2.txt
@del c:\h.bat
@del c:\h.bas
You can play with MS-DOS 6.22 in a virtual-machine-in-browser here[1]. That VM comes with Vim (non-standard) so use Vim or Edit to create a word list and save as c:\words.txt. Then yype all this code into a batch file using `edit run.bat` and then run it with `run %1`. MS-DOS 6.22 came with QBASIC so I think that's allowed; I tried to avoid it but wasn't able to. NB. DOS is way less capable than Windows cmd prompt so there's no `for /f` or anything. "Dir /OS /B" sorts files by size and that view will leave the largest files on screen as the answer. The files will be one per word, containing the word so the size in bytes is the word length and the filename is the word to see it in the file listing. The words will be echoed into the files by a helper batch file containing `echo %1 > %1`. Building the helper batch file is hard because echo cannot echo > into a file. The qwerty filtering is a chain of `find /V "a"` for excluding each of the other rows cough. I then couldn't loop over the file lines without QBASIC.
If you never used MS-DOS classic, try "edit test.txt" and see how it has a nice TUI, where Alt+F brings up the File menu, the brightly coloured letters are the hotkeys, so Alt+F, X will quit. Shift+Down will select a line, Shift+Delete to cut and Shift+Insert to paste. Ctrl+left/right arrows to jump forward/back a word, Ctrl+Shift+Left/Right to select a word. 29 years later those keyboard patterns still work in this FireFox editor, in current notepad, WordPad and Word, and in my muscle memory. Escape tends to exit back out of popups and menus. Quit and try "help date" and see the TUI help, where the green angle brackets are hyperlinks and can be TAB'ed between, Enter to activate and Escape back. F1 is still the help key, only it actually showed offline help back then instead of doing a Bing search for 'get help in notepad'. Quit and run QBasic, see how F5 runs the code.
Here's something you may not know, the *-insane dictionaries, which are giant, are functions of OCR output and are known to contain lots of errors.
I found a few earlier this year and I was going to file a bug so I did some research to find out this is a known and expected behavior.
If the computer say reads stubborn as stubbum, the smaller dictionaries are the ones that have cross checked and filtered those out. The insane ones do not. It's a good name. "Lack of sanity checks"
Here's an example word I found, "suabilities". You'll find it only on wordlist sites that used this wordlist and I guess, now here.
just saw this. I've got no idea how kanji ocr works but I do know enough japanese to know what most of those characters are attempting to refer to, my penmanship has certainly been that bad. I still don't understand how it would make its way into the standard unless that part wasn't written by someone who is competent in japanese.
I wonder how often that happens - surely there's tons of people dealing with japanese text who can't read it and just use diligence to make sure the "letters are the same"
I've used the insane dictionaries a number of times for puzzle stuff and I never knew that they were derived from OCR output. Thanks for mentioning that!
You might find the... 'translation'[1] of Genesis 1 using only keys on the Colemak home row interesting:
In the start The One has risen the stars and the earth.
The earth had no order, and nothin' resided there; and shade resided on the nonendin' 'neath. And The One rided on the seas.
Then The One said: "I desire it to shine"; and it shone.
And The One had seen the shine, that it's neat; and The One sorted the shine on one side, and the shade on the other.
The One then denoted the shine and the shade. So the nite and the shine that are date no. one had ended.
I tried to do some other fun things like going row by row with each row only contributing one letter and seeing what’s the longest word I could come up with.
If I start at the top row and go down, I can make TAXES but couldn’t think of a longer word. The third row having no vowels makes it so hard.
Starting at the bottom row and going up, I came up with CHICKEN which is delicious and neat that it ends where it started. Chickens is longer but ends on the middle row which is not as neat I feel like :(
> If I start at the top row and go down, I can make TAXES but couldn’t think of a longer word. The third row having no vowels makes it so hard.
A dictionary search turned up "paxwaxes" as the longest word I could find that starts in the top row and goes down, wrapping around to the top every three letters.
> Starting at the bottom row and going up, I came up with CHICKEN which is delicious and neat that it ends where it started. Chickens is longer but ends on the middle row which is not as neat I feel like :(
Chickens is indeed the longest.
If you start at the bottom row and go up-and-down: cataclysms, or catamarans.
If you start at the top row and go down-and-up: escapable.
If you start in the middle and go down-and-up: scarabaean
If you start in the middle and go up-and-down, I didn't find anything longer than 7 letters, and there were 39 seven-letter words, including "discard", "grandpa", and "stacked".
Related, is there a high quality plaintext dictionary file for running similar searches? I’ve spent several hours but couldn’t find one that’s both comprehensive and accurate.
What are your rules for what counts as a "word"? If you go with the basic scrabble rules (i.e. nothing that would be capitalized or punctuated) then YAWL[1] is pretty good, with the downside being the most recent version I know of is from 2008.
FYI, rupturewort is the sole 11-letter word answer to TFA in YAWL; found using:
Some common Linux distributions have packages that provide word list files to /usr/share/dict/ in several languages. It's likely for English files to be preinstalled. I've had a plenty of fun practising regex and pipes with these word lists!
There's four: a hamlet named Treopert, a mountain range in Snowdonia named Eryri, a house in Bangor University named after that mountain range, and TUI outdoor shop.
Curiously the longest top-row places I can find anywhere on OpenStreetMap are almost all roads in France:
Rue Pierre Ropert
Rue Pierre Riquet (x2)
Rue Pierre Poutot
Rue Pierre Potier (x4)
Rue Pierre Perret (x3)
Rue Pierre Pietri
Route Petit Peyre (x2)
Poirier Pierrotte
Petite Rue Pierre
Rue Pierre Perrier (x3)
Rue Pierre Pottier (x3)
Rue Pierre Routier
Rue Poirier Piquet
Rue Pouyer Quertier (x2)
Route Pourpre et Or
Tyrepower Port Pirie <-- a shop in Australia
Petite Route Petite Rue
Germans would type on a QWERTZ keyboard (Z and Y are swapped). This may theoretically considerably open up the space of possible top-row-only German words, as Z is very common in German (especially after T).
> I imagine German has some epic words that can be written in just the first row.
My understanding of German grammar is that words can be of infinite length since they allow unlimited compounding. But they also have a language authority that makes words official, and the current longest is 68 letters.
So it would indeed be an interesting exercise in German.
What is the name of this language authority? I researched and found "Council for German Orthography" and "Gesellschaft für deutsche Sprache" (Association for the German Language).
Duden (the standard German dictionary) is the closest I know. And they list "Aufmerksamkeitsdefizit-Hyperaktivitätsstörung" (the German word for attention deficit hyperactivity disorder) with 44 letters as the longest in the dictionary [1].
*Update:* There are also [2] and [3] but they are both not anymore part of a law (the "-gesetz" suffix in the word) or regulation (the "-verordnung" suffix in the word), respectively.
A number of years ago I solved a minor but repetitive QoL problem I had, and created a password I could type with just my left hand. It started as 8 characters, but I now have variants with as many as 15 characters. Not a word, or even words strung together, but it is so nice being able to just type it with one hand.
This post finally got me to dig back up the ultimate word trivia website, valiantly hosted on Tripod and still maintained: https://jeff560.tripod.com/words1.html
Most dictionaries list only the standard form PROPRIETARY, so it is arguable whether a word list should contain "PROPRIETORY": although Merriam-Webster and Wiktionary (unlike most dictionaries) list it as an alternative spelling, and it does occur a few times in the wild (e.g. see https://books.google.com/ngrams/graph?content=proprietory%2C... for a comparison), it is not surprising that most word lists leave it out. (Or, even if it occurs in a word list, Stephens may have excluded it, and there is sufficient justification for doing so.)
I found myself mildly annoyed that the author calls QWERTY the “first row” of letters not the “top row”. If there was a “first”, I might nominate ZXCVBNM. Is QWERTY commonly known as the “first row”?
If you are interested in full sentences there's a "What if?" [1] that explains how to generate sentences using a single row of your keyboard (with a link to code [2]) or stranger stuff like "We reserved seats at a secret Starcraft fest".
for this kind of thing, aside from of course /usr/share/dict/words or /usr/share/dict/spanish, i commonly use a word list sorted by occurrences in the british national corpus which i keep at http://canonical.org/~kragen/sw/wordlist
this allows you to, among other things, tune the comprehensiveness/accuracy tradeoff to your liking for a particular task by cutting the list off at a given point
probably i should download the google 1-grams now that i have a bigger disk
possibly a more practical problem is, what are the most common words you can type entirely with the left hand while your other hand is on the mouse†; 'redraw' was a significant one with early versions of autocad
$ grep ' [qwertasdfgzxcvb]*$' ~/wordlist | head
2150885 a
923975 was
664780 be
478178 at
470949 are
that's a bit better. how about words you can type alternating the two hands, so you can type faster
$ egrep ' [^qwertasdfgzxcvb]?([qwertasdfgzxcvb][^qwertasdfgzxcvb])*[qwertasdfgzxcvb]?$' ~/wordlist | perl -ane 'print "$F[1] " if length $F[1] > 6' | fmt | head -4
problem problems england chairman alright element ancient visible penalty
quantity visitor signals amendment claudia bicycle authentic antibody
malaysia naughty dickens entitlement antique paisley rituals auditor
endowment blanche chairmen siemens chaotic suspend uruguay mcleish
it's a curious experience to touch-type a sequence of these words because after a while you notice that something is unusual. the one-handed words are a bit more conspicuous. try typing 'a better career award as edward created database facts after we agreed we stared at dear steve' into the comment box, it's super weird
of course the bnc has some built-in biases which dramatically understate the frequency of certain words
(defconstant *wordlist*
(with-open-file (in #P"~/wordlist")
(loop for line = (read-line in nil)
while line
for p = (position #\space line :from-end t)
collect (list (parse-integer (subseq line 0 p))
(subseq line (1+ p))))))
(defconstant *topwords*
(let ((w (loop for (freq word) in *wordlist*
if (loop for c across word always (find c "qwertyuiop"))
collect word)))
(sort w #'< :key #'length)))
those last four lines of code seem very acceptable to me but not really competitive with the unix approach for interactive experimentation
I wonder how fast you could get at typing everything onehanded with some modifier key that mirrors the keyboard. Right hand is probably more useful there.
Douglas Englebart's invention of the mouse, and Mother of All Demos in 1968 had mouse for the right hand, chording keyboard for the left hand. Seen here: https://youtu.be/UhpTiWyVa6k?t=1949
(It's still amazing to watch him explain that they don't look at the mouse while moving it, they look at the pointer).
the critical path for fast typing is the precision with which you can synchronize the motions of different fingers, and in particular different hands; if keydown events happen in the wrong order, you start to get tranpsosition erorrs
chording keyboards don't care in which order the keys in the chord start; they only care what the set of keys in the chord is and when it ends (so they can stop looking for new keys to add to the set). fast typists on chording keyboards can do 300 words per minute, which is about 7 chords per second, which is about as many "strokes" as a normal typist on a non-chording keyboard, just with chords (producing a syllable each) instead of individual keys
(producing a letter each)
adding a required modifier key that has to happen before a keystroke, and end before the next keystroke, is the opposite; it adds more things you have to sequence correctly. in the half-keyboard patch case, in particular, it adds 50% more things. this slows down your typing by about a third
I happen to have a corpus which includes pretty much every word ever written in a book, including many misspelled, mistranscribed, or otherwise non-dictionary words.
After eliminating nonsense, non-English, or other mistakes, I think the real winner, coming it at 12 characters, is:
That's a relatively common word. Even though it's usually seen hyphenated, the unhyphenated form is recognized by all the online dictionaries I found.----
And some other candidates, just for fun, in the 13 or 12 character range:
"proproprietor" seems more like a misspelling. Should have a hyphen, or be two words."priorityqueue" is of course familiar to hackers here, but is more of a jargon term, and is only concatenated due to appearing in source code. Invariably it's two words when actually written out.
"reporterette" is antique, but appeared in a NYTimes headline as late as 2018 - the author reflected on her career, including sexist epithets. https://www.nytimes.com/2018/12/02/opinion/george-hw-bush-ma...
"preprototype" is used exactly as is, in lots of scientific papers, up to the current day. That's a pretty good one too, and could be a tie for "teetertotter", but it's verging on jargon.