[ rss / options / help ]
post ]
[ b / iq / g / zoo ] [ e / news / lab ] [ v / nom / pol / eco / emo / 101 / shed ]
[ art / A / beat / boo / com / fat / job / lit / map / mph / poof / £$€¥ / spo / uhu / uni / x / y ] [ * | sfw | o ]
logo
technology

Return ]

Posting mode: Reply
Reply ]
Subject   (reply to 24247)
Message
File  []
close
142757689657.jpg
242472424724247
>> No. 24247 Anonymous
3rd June 2015
Wednesday 4:19 pm
24247 spacer
How to extract links from a page?

What I've been doing is downloading the source code and appending a little script to it that uses getElementsByTagName to extract them and then write them to the body.

I know this is retarded but I don't know how else to do it. Is there a way to run JS on the fly on a web page or something?
Expand all images.
>> No. 24248 Anonymous
3rd June 2015
Wednesday 4:20 pm
24248 spacer
A bookmarklet?
>> No. 24249 Anonymous
3rd June 2015
Wednesday 4:24 pm
24249 spacer
>>24248

That looks like what I need, thanks.
>> No. 24250 Anonymous
3rd June 2015
Wednesday 4:28 pm
24250 spacer
Is Zotero what you want?
>> No. 24251 Anonymous
3rd June 2015
Wednesday 4:30 pm
24251 spacer
>>24250

nah
>> No. 24258 Anonymous
3rd June 2015
Wednesday 10:13 pm
24258 spacer
I've been using an old piece of freeware which works well most of the time but not necessarily on the mangled HTML Microsoft Word spits out for example.

http://web.archive.org/web/20140722080716/http://www.focalmedia.net/urlextract.html
>> No. 24269 Anonymous
8th June 2015
Monday 11:49 pm
24269 spacer
>>24247

OP here. I just discovered I can do what I want with FireBug.

This is the code I wrote to extract the links in a thread on cripplechan. Is it good code? I've been teaching myself for about a year.

http://pastebin.com/XJa7JUfd
>> No. 24270 Anonymous
9th June 2015
Tuesday 12:21 am
24270 spacer
>>24258
I remember someone complaining that they couldn't paste text from Word into a Web form. It turned out the text was automatically being rendered as rich text, but Word had inserted so much obnoxious XML that it had maxed out the underlying field before even reaching the first printable character.
>> No. 24271 Anonymous
9th June 2015
Tuesday 2:02 am
24271 spacer
>>24270

Are there still browsers without "paste as plain text" in the right-click context menu?
>> No. 24273 Anonymous
9th June 2015
Tuesday 6:11 am
24273 spacer
>>24271

I've never noticed that before. It was one of the context menu options my eyes just automatically ignore.
>> No. 24274 Anonymous
9th June 2015
Tuesday 9:23 am
24274 spacer
I stumbled across this yesterday.

It's in python so probably not what you want.

https://github.com/jabbalaci/Bash-Utils/blob/master/get_links.py
>> No. 24275 Anonymous
9th June 2015
Tuesday 9:35 am
24275 spacer
>>24271
By all means continue to blame the victim.
>> No. 24276 Anonymous
9th June 2015
Tuesday 9:37 am
24276 spacer
>>24275
You chose to use the device you are and any attempt to deny that is a denial of free will.
>> No. 24277 Anonymous
9th June 2015
Tuesday 9:42 am
24277 spacer
>>24276
No, corporate IT chose for them to use that browser, but go on blaming the victim if it makes you feel better, you rapist.
>> No. 24287 Anonymous
11th June 2015
Thursday 7:29 pm
24287 spacer
>>24269

Not bad. Trying using regexes.
>> No. 24288 Anonymous
11th June 2015
Thursday 8:58 pm
24288 spacer
>>24287
Now he has two problems.
>> No. 24289 Anonymous
11th June 2015
Thursday 9:18 pm
24289 spacer

htmlregex.png
242892428924289
>>24287
No. Just ... no.
>> No. 24290 Anonymous
11th June 2015
Thursday 10:17 pm
24290 spacer
>>24269
endsWith('png.png', 'png') === false
>> No. 24291 Anonymous
12th June 2015
Friday 12:15 pm
24291 spacer
>>24290

You need the dot in the extension.

endsWith('png.png', '.png') === true
>> No. 24292 Anonymous
12th June 2015
Friday 12:26 pm
24292 spacer
>>24289

I'm suggesting that he use regexes to filter the hrefs not parse the html. Though regexes could be used to search through the source code as text.
>> No. 24293 Anonymous
12th June 2015
Friday 12:32 pm
24293 spacer
>>24292

/\.(jpg|jpeg|png|gif|bmp|mp4|webm)$/i

I think that should do it
>> No. 24297 Anonymous
12th June 2015
Friday 12:42 pm
24297 spacer
>>24293

/\.(jp(e?)g|png|gif|bmp|mp4|webm)$/i

???
>> No. 24298 Anonymous
12th June 2015
Friday 12:42 pm
24298 spacer
>>24297

Finished?
>> No. 24299 Anonymous
12th June 2015
Friday 12:58 pm
24299 spacer
>>24292

Search through the source code with this:

/http(s?):\/\/(.+)\.(jp(e?)g|png|gif|bmp|mp4|webm)/i

I think.
>> No. 24300 Anonymous
12th June 2015
Friday 4:52 pm
24300 spacer
>>24299

/http(s?):\/\/(/S+)\.(jp(e?)g|png|gif|bmp|mp4|webm)/i

...otherwise I think it'll match the beginning of the first link all the way to the end of the last link. But then this is probably still hopelessly broken too - I'm shit at coding so I thought I'd try and help
>> No. 24302 Anonymous
12th June 2015
Friday 5:12 pm
24302 spacer
>>24291
You need to think harder.

endsWith('gabrielle.giffords.gif', '.gif') === false
>> No. 24304 Anonymous
12th June 2015
Friday 5:33 pm
24304 spacer
>>24302

I don't code a lot of javascript, thank christ, but 'return text.indexOf(substr) === text.length - substr.length;' seems like an incredibly barse-ackwards way of checking the end of a string.

Something like 'return text.slice(-substr.length) === substr' seems a lot more logical.
>> No. 24305 Anonymous
12th June 2015
Friday 5:57 pm
24305 spacer
I'll write something in bash for you if i cba, later on.
>> No. 24307 Anonymous
12th June 2015
Friday 6:12 pm
24307 spacer
>>24277
Corporate IT plays a significant role in choosing a company's browser, sure. When you interviewed for your current company you had the chance to learn what the IT department were like and you chose to work there anyway.

Corporate IT often gets away with a lot of shit because people complain among themselves and not to the right people. That's probably true of life and not just corporate IT. People whine and bitch that people in power are fucking them, but so often they've never expressed their lack of consent.
>> No. 24308 Anonymous
12th June 2015
Friday 6:21 pm
24308 spacer
>>24307
You're doing really well, rapistlad. Keep on digging. Them victims wont no wot hit em.
>> No. 24310 Anonymous
12th June 2015
Friday 8:47 pm
24310 spacer
>>24302

Oh, I see now. indexOf finds the first occurrence. Like I said, I'm still learning.

Return ]
whiteline

Delete Post []
Password