Hacker News new | past | comments | ask | show | jobs | submit | siraben's comments login

Some of the text seems to be cut off, but there is a docx version that loads fine with LibreOffice.[0] I am hosting a PDF export of that too.[1]

[0] https://downloads.reactivemicro.com/Electronics/Reverse%20En...

[1] https://cloud.siraben.dev/s/z9GTFfjDDgGXHSQ


Smooth experience! I loved the details such as the assistant getting a bit annoyed when you go to the vending machine for a drink or “I regret to inform you…” when you try to use the internet terminal on board.


What's the word error rate? Is it the same as the distilled whisper models?


Yes, most Chinese characters are phono-semantic compounds.[0] However, this makes most sense in the original Old Chinese[1] phonology, where the forms were taking place.

For example,

偒 was pronounced /*l̥ʰaːŋʔ/ and 陽 was pronounced /*laŋ/, but the modern pronunciations are tǎng (/tʰɑŋ²¹⁴/) and yáng (/jɑŋ³⁵/) respectively. So the phonetic part 昜 /*laŋ/ no longer consistently represents that sound, although in this case the final -aŋ is still present.

And as for sounds that were present in Old Chinese but not in Middle Chinese and Mandarin, like [2] 巽 was pronounced /*sqʰuːns/, now xùn (/ɕyn⁵¹/), they underwent a series of regular sound shifts that make them sound quite different when used in characters in Mandarin.

Also, Old Chinese was not a tonal language, tones first appeared in Middle Chinese, which the modern system derives from (with changes). Tones never had a chance to appear in writing.

[0] https://en.wikipedia.org/wiki/Chinese_character_classificati...

[1] https://en.wikipedia.org/wiki/Old_Chinese

[2] https://en.wiktionary.org/wiki/%E5%B7%BD


Why not translate your code to pointfree style automatically? Using[0], you can go from

  quad a b c = let d = b * b - 4 * a * c in ((-b + sqrt d) / 2 * a, (-b - sqrt d) / 2 * a)
to

  ghci> import Control.Monad
  ghci> quad = ap (ap . ((.) .) . ap (ap . (liftM2 (,) .) . flip (flip . ((*) .) . flip flip 2 . ((/) .) . (. sqrt) . (+) . negate)) (flip (flip . ((*) .) . flip flip 2 . ((/) .) . (. sqrt) . (-) . negate))) (flip ((.) . (-) . join (*)) . (*) . (4 *))
  ghci> quad 1 3 (-4)
  (1.0,-4.0)
[0] https://pointfree.io/


Apart from "just because we can" or "it's fun", why on earth would someone prefer the second style?


In this example I can’t imagine anyone preferring the second style, but there are cases where it’s nicer. For example compare the tacit:

    foo = h . g . f
With the more verbose:

    foo x =
      let
        a = f x
        b = g a
        c = h b
      in c
If a, b, and c have useful names that help you understand the code then the second function might be preferable- but in a lot of cases all the intermediate variables are just adding noise and making it harder to see what’s happening at a glance. The tacit example makes it very clear at a quick glance exactly what’s happening.

My personal rule of thumb is that if you are passing combinators in as arguments to other combinators then you should probably stop, but straightforward chaining is usually okay.


It's a joke.


tacit programming means you don't use argument names to direct your data to the desired output. what's interesting to me about that is the unexplored possibilities of how data could be directed without names.


Clearly superior.


Probably protects you against a decent amount of tracking, but there are numerous markers still that don’t even rely on cookies, and when used together can be correlated back to your profile. For instance, IP address, user agent OS string, timezone, hardware capabilities via the browser, etc.


Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]

[0] https://github.com/netease-youdao/EmotiVoice

[1] https://github.com/siraben/emotivoice-cli

[2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...


Hi, WhisperSpeech dev here, we only support Polish and English at the moment but we just finished doing some inference optimizations and are looking to add more languages.

What we seem to need is high-quality speech recordings in any language (audiobooks are great) and some recordings for each target language which can be low-quality but need varied prosody/emotions (otherwise everything we generate will sound like an audiobook).


Last I checked, LibriVox had about 11 hours of Mandarin audiobooks and Common Voice has 234 validated hours of "Chinese (China)" (probably corresponding to Mandarin as spoken on the mainland paired with text in Simplified characters, but who knows) and 77 validated hours of "Chinese (Taiwan)" (probably Taiwanese Mandarin paired with Traditional characters).

Not sure whether that's enough data for you. (If you need paired text for the LibriVox audiobooks, I can provide you with versions where I "fixed" the original text to match the audiobook content e.g. when someone skipped a line.)


Librivox seems like a great source, being public domain, though the quality is highly variable.

I can recommend Elizabeth Klett as a good narrator. I've sampled her recordings of Jane Austen books Emma, pride and prejudice, and sense and sensibility.


For Polish I have around 700hr. I suspect that we will need less hours if we add more languages since they do overlap to some extent.

Fixed transcripts would be nice although we need to align them with the audio really precisely (we cut the audio into 30 second chunks and we pretty much need to have the exact text in every chunk). It seems this can be solved with forced alignment algorithms but I have not dived into that yet.


I have forced alignments, too.

E.g. for the True Story of Ah Q https://github.com/Yorwba/LiteratureForEyesAndEars/tree/mast... .align.json is my homegrown alignment format, .srt are standard subtitles, .txt is the text, but note that in some places I have [[original text||what it is pronounced as]] annotations to make the forced alignment work better. (E.g. the "." in LibriVox.org, pronounced as 點 "diǎn" in Mandarin.) Oh, and cmn-Hans is the same thing transliterated into Simplified Chinese.

The corresponding LibriVox URL is predictably https://librivox.org/the-true-story-of-ah-q-by-xun-lu/


Thanks, I'll check it out. I don't know any Chinese so I'll probably reach out to you for some help :)


Sure, feel free to email me at the address included in every commit in the repo.


You might check out this list from espnet. They list the different corpuses they use to train their models sorted by language and task (ASR, TTS etc):

https://github.com/espnet/espnet/blob/master/egs2/README.md


Just listened to the demo voices for EmotiVoice and WhisperSpeech. I think WhisperSpeech edges out EmotiVoice. EmotiVoice sounds like it was trained on English spoken by non-native speakers.


Did you try XTTS v2 for Mandarin? I'm curious how it compares with EmotiVoice.


It has a big problem with hallucination in Chinese, random extra syllables all over the place.


Makes sense, I get hallucinations in English too.


Have you released your flashcard app?


If you're interested, I have a small side project (https://imaginanki.com) for generating Anki decks with images + speech (via SDXL/Azure).


Some language learning resources From "Show HN: Open-source tool for creating courses like Duolingo" (2023) https://news.ycombinator.com/item?id=38317345 :

> ENH: Generate Anki decks with {IPA symbols, Greek letters w/ LaTeX for math and science,


Not OP, but I develop Mochi [0] which is a spaced repetition flash card app that has text-to-speech and a bunch of other stuff built in (transcription, dictionaries, etc.) that you might be interested in.

[0] https://mochi.cards


What spaced repetition algorithm does it use?


It's just an Anki deck.


I’ve been using FSRS for 3 months and it’s finally resolved some of my pain points about having to trial-and-error adjust the old SM2 scheduling algorithm, since the content of each deck can greatly affect what the optimal retention is. Now you can just retrain the weights for each deck you have every few months and it will adapt appropriately. The paper[0] is also definitely worth reading if you want to see some rigorous analysis of large-scale real-world spaced repetition science.

Because of the extensive benchmarking most people probably will not benefit from refitting the weights to their collection until they have thousands of reviews (author recommends 1k+).

Note it still works fine even if you do your cards late, since the recall probabilities are based on the stability and when you last reviewed the card, and the stability will update a bit longer if you somehow managed to still recall a card after the due date.

[0] https://dl.acm.org/doi/10.1145/3534678.3539081?cid=996605471...

[1] https://github.com/open-spaced-repetition/fsrs4anki/wiki/The...


In the limit, you can get coffee routines like this: [0]

[0] https://www.youtube.com/watch?v=Gst_NYxAg9s


"...our very expensive grinder." To save you the effort: that's about $4300 of coffee grinder, in that video. (Weber EG-1, Black. White is $400 "cheaper.")


That routine is robberr baron extravagant.


Is there a high quality speech synthesizer (ideally local) for Mandarin you have found? There are some subtleties with tone sandhi rules and how they interact with prosody that I feel are lacking with current TTS voices I’ve tried.


I love the idea of LLMs being super-efficient language tutors. And you have a good point; coming soon: "We've been getting a lot of these tourists here lately, they're eerily fluent, but all seem to have the same minor speech impediment" (read: messed-up weights in a commonly used speech model).


I've been using ChatGPT 4 to translate and explain various texts in Mandarin and it's been very on point (checking with native speakers from time to time, or internet searches). As expected, it has trouble with slang and cross-language loanwords from time to time. However for languages with much lower information online, it hallucinates like crazy.

> coming soon: "We've been getting a lot of these tourists here lately, they're eerily fluent, but all seem to have the same minor speech impediment"

Haha, if that were to pass, that would still be a far better outcome than our current situation of completely blind machine translation (this is especially for various Asian languages that are very sensitive to phrasing) and mispronunciation by non-native speakers.


> all seem to have the same minor speech impediment

Ah, that is called an accent.


Kind of, Accents are typically derived from the intersection of natural languages, specifically which ones you learned the phonetics of first. (With the exception of the Mid-Atlantic accent...)

This would be something quite novel as the speech irregularities would not have their origin in people

I don't know what you would call it but it needs at least some adjective before accent to differentiate it IMO


The first one I plan to try is https://github.com/netease-youdao/EmotiVoice

I don't have the expertise to judge the quality of Mandarin pronunciation myself, being a beginner. But it sounds OK in English and it's made by native Mandarin speakers in China so I expect that it sounds better in Mandarin than English.


Sounds pretty good, although still lacking in natural-sounding tone sandhi (e.g. try 一下, it should be yi2xia4 instead of yi1xia4).


Do you have a favorite Chinese learning app ?


the azure neural tts voices in chinese are the best i’ve heard, specifically the “xiaochen” voice. i use it in anki daily to generate sentences for my mandarin decks with an api key/plugin. it’s not something you run locally of course, but they have a decent enough free tier.

i’m hoping a voice as realistic as this becomes a local app soon, but i’ve not found anything that’s nearly as natural sounding yet. (also, honorable mention to chatgpt’s “sky.” she pronounces mandarin with a funnily american accent, but it sounds natural and not as robotic as the open-source alternatives i’ve tried)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: